2025-05-07T20:22:34.9159694Z Current runner version: '2.323.0' 2025-05-07T20:22:34.9165976Z Runner name: 'i-0a11e2b4e0c9387f6' 2025-05-07T20:22:34.9166904Z Machine name: 'ip-10-0-64-8' 2025-05-07T20:22:34.9169577Z ##[group]GITHUB_TOKEN Permissions 2025-05-07T20:22:34.9171841Z Contents: read 2025-05-07T20:22:34.9172433Z Metadata: read 2025-05-07T20:22:34.9172917Z Packages: read 2025-05-07T20:22:34.9173406Z ##[endgroup] 2025-05-07T20:22:34.9175329Z Secret source: None 2025-05-07T20:22:34.9175943Z Prepare workflow directory 2025-05-07T20:22:35.7679180Z Prepare all required actions 2025-05-07T20:22:35.7720774Z Getting action download info 2025-05-07T20:22:35.9704478Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683) 2025-05-07T20:22:36.2235513Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093) 2025-05-07T20:22:36.5884248Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187) 2025-05-07T20:22:38.1730569Z Getting action download info 2025-05-07T20:22:38.2759756Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482) 2025-05-07T20:22:38.5065559Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.12, 12.8.0, 12.6.3, clang) 2025-05-07T20:22:38.5567912Z A job started hook has been configured by the self-hosted runner administrator 2025-05-07T20:22:38.5673714Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh' 2025-05-07T20:22:38.5685181Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:22:38.5685834Z ##[endgroup] 2025-05-07T20:22:39.5929575Z Runner Type: linux.g5.4xlarge.nvidia.gpu 2025-05-07T20:22:39.5929976Z Instance Type: g5.4xlarge 2025-05-07T20:22:39.5930274Z AMI Name: unknown 2025-05-07T20:22:39.5964854Z AMI ID: ami-071226ecf16aa7d96 2025-05-07T20:22:44.9880249Z ##[group]Run actions/checkout@v4 2025-05-07T20:22:44.9880541Z with: 2025-05-07T20:22:44.9880753Z submodules: true 2025-05-07T20:22:44.9880992Z repository: pytorch/FBGEMM 2025-05-07T20:22:44.9881371Z token: *** 2025-05-07T20:22:44.9881575Z ssh-strict: true 2025-05-07T20:22:44.9881790Z ssh-user: git 2025-05-07T20:22:44.9882021Z persist-credentials: true 2025-05-07T20:22:44.9882280Z clean: true 2025-05-07T20:22:44.9882513Z sparse-checkout-cone-mode: true 2025-05-07T20:22:44.9882785Z fetch-depth: 1 2025-05-07T20:22:44.9883006Z fetch-tags: false 2025-05-07T20:22:44.9883230Z show-progress: true 2025-05-07T20:22:44.9883468Z lfs: false 2025-05-07T20:22:44.9883708Z set-safe-directory: true 2025-05-07T20:22:44.9883988Z env: 2025-05-07T20:22:44.9884198Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:44.9884497Z BUILD_ENV: build_binary 2025-05-07T20:22:44.9884740Z BUILD_TARGET: genai 2025-05-07T20:22:44.9884964Z BUILD_VARIANT: cuda 2025-05-07T20:22:44.9885229Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:22:44.9885483Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:44.9885725Z ##[endgroup] 2025-05-07T20:22:45.1024600Z Syncing repository: pytorch/FBGEMM 2025-05-07T20:22:45.1025839Z ##[group]Getting Git version info 2025-05-07T20:22:45.1026297Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM' 2025-05-07T20:22:45.1026910Z [command]/usr/bin/git version 2025-05-07T20:22:45.1027683Z git version 2.47.1 2025-05-07T20:22:45.1053413Z ##[endgroup] 2025-05-07T20:22:45.1075322Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/0be5a3c3-c8ad-4d74-96f6-b84f970e55ff' before making global git config changes 2025-05-07T20:22:45.1076244Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:22:45.1080534Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:22:45.1116522Z Deleting the contents of '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM' 2025-05-07T20:22:45.6219036Z ##[group]Initializing the repository 2025-05-07T20:22:45.6224664Z [command]/usr/bin/git init /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:22:45.6277425Z hint: Using 'master' as the name for the initial branch. This default branch name 2025-05-07T20:22:45.6278109Z hint: is subject to change. To configure the initial branch name to use in all 2025-05-07T20:22:45.6278660Z hint: of your new repositories, which will suppress this warning, call: 2025-05-07T20:22:45.6279044Z hint: 2025-05-07T20:22:45.6279320Z hint: git config --global init.defaultBranch 2025-05-07T20:22:45.6279656Z hint: 2025-05-07T20:22:45.6279960Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and 2025-05-07T20:22:45.6280490Z hint: 'development'. The just-created branch can be renamed via this command: 2025-05-07T20:22:45.6280923Z hint: 2025-05-07T20:22:45.6281127Z hint: git branch -m 2025-05-07T20:22:45.6281605Z Initialized empty Git repository in /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/ 2025-05-07T20:22:45.6290071Z [command]/usr/bin/git remote add origin https://github.com/pytorch/FBGEMM 2025-05-07T20:22:45.6324395Z ##[endgroup] 2025-05-07T20:22:45.6324825Z ##[group]Disabling automatic garbage collection 2025-05-07T20:22:45.6328444Z [command]/usr/bin/git config --local gc.auto 0 2025-05-07T20:22:45.6360455Z ##[endgroup] 2025-05-07T20:22:45.6360823Z ##[group]Setting up auth 2025-05-07T20:22:45.6366775Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:22:45.6398033Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:22:45.6747673Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:22:45.6780878Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:22:45.7128653Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:22:45.7185243Z ##[endgroup] 2025-05-07T20:22:45.7185647Z ##[group]Fetching the repository 2025-05-07T20:22:45.7192577Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge 2025-05-07T20:22:46.1357325Z From https://github.com/pytorch/FBGEMM 2025-05-07T20:22:46.1357915Z * [new ref] a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge 2025-05-07T20:22:46.1384636Z ##[endgroup] 2025-05-07T20:22:46.1385073Z ##[group]Determining the checkout info 2025-05-07T20:22:46.1387670Z ##[endgroup] 2025-05-07T20:22:46.1393716Z [command]/usr/bin/git sparse-checkout disable 2025-05-07T20:22:46.1441699Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2025-05-07T20:22:46.1480146Z ##[group]Checking out the ref 2025-05-07T20:22:46.1484624Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge 2025-05-07T20:22:46.2581367Z Note: switching to 'refs/remotes/pull/4066/merge'. 2025-05-07T20:22:46.2581755Z 2025-05-07T20:22:46.2582045Z You are in 'detached HEAD' state. You can look around, make experimental 2025-05-07T20:22:46.2582717Z changes and commit them, and you can discard any commits you make in this 2025-05-07T20:22:46.2583590Z state without impacting any branches by switching back to a branch. 2025-05-07T20:22:46.2583995Z 2025-05-07T20:22:46.2584263Z If you want to create a new branch to retain commits you create, you may 2025-05-07T20:22:46.2584885Z do so (now or later) by using -c with the switch command. Example: 2025-05-07T20:22:46.2585250Z 2025-05-07T20:22:46.2585402Z git switch -c 2025-05-07T20:22:46.2585656Z 2025-05-07T20:22:46.2585847Z Or undo this operation with: 2025-05-07T20:22:46.2586085Z 2025-05-07T20:22:46.2586200Z git switch - 2025-05-07T20:22:46.2586687Z 2025-05-07T20:22:46.2587002Z Turn off this advice by setting config variable advice.detachedHead to false 2025-05-07T20:22:46.2587446Z 2025-05-07T20:22:46.2587975Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4 2025-05-07T20:22:46.2595521Z ##[endgroup] 2025-05-07T20:22:46.2595987Z ##[group]Setting up auth for fetching submodules 2025-05-07T20:22:46.2600735Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:22:46.2650601Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf 2025-05-07T20:22:46.2686895Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2025-05-07T20:22:46.2722858Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2025-05-07T20:22:46.2754373Z ##[endgroup] 2025-05-07T20:22:46.2754754Z ##[group]Fetching submodules 2025-05-07T20:22:46.2757729Z [command]/usr/bin/git submodule sync 2025-05-07T20:22:46.3108148Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1 2025-05-07T20:22:46.3443831Z Submodule 'external/asmjit' (https://github.com/asmjit/asmjit.git) registered for path 'external/asmjit' 2025-05-07T20:22:46.3445925Z Submodule 'external/composable_kernel' (https://github.com/jwfromm/composable_kernel.git) registered for path 'external/composable_kernel' 2025-05-07T20:22:46.3449011Z Submodule 'external/cpuinfo' (https://github.com/pytorch/cpuinfo) registered for path 'external/cpuinfo' 2025-05-07T20:22:46.3452479Z Submodule 'external/cutlass' (https://github.com/jwfromm/cutlass) registered for path 'external/cutlass' 2025-05-07T20:22:46.3456130Z Submodule 'external/googletest' (https://github.com/google/googletest) registered for path 'external/googletest' 2025-05-07T20:22:46.3459783Z Submodule 'external/hipify_torch' (https://github.com/ROCmSoftwarePlatform/hipify_torch.git) registered for path 'external/hipify_torch' 2025-05-07T20:22:46.3463037Z Submodule 'external/json' (https://github.com/nlohmann/json.git) registered for path 'external/json' 2025-05-07T20:22:46.3493328Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/asmjit'... 2025-05-07T20:22:46.6930230Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/composable_kernel'... 2025-05-07T20:22:47.1863441Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cpuinfo'... 2025-05-07T20:22:47.5917585Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cutlass'... 2025-05-07T20:22:48.6604997Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/googletest'... 2025-05-07T20:22:48.9482287Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/hipify_torch'... 2025-05-07T20:22:49.1911953Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/json'... 2025-05-07T20:22:50.4146154Z From https://github.com/asmjit/asmjit 2025-05-07T20:22:50.4146641Z * branch e5d7c0bd5d9aec44d68830187138149e6a8c4e32 -> FETCH_HEAD 2025-05-07T20:22:50.4627975Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32' 2025-05-07T20:22:51.1458541Z From https://github.com/jwfromm/composable_kernel 2025-05-07T20:22:51.1459024Z * branch 4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 -> FETCH_HEAD 2025-05-07T20:22:51.4210367Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406' 2025-05-07T20:22:52.1150723Z From https://github.com/pytorch/cpuinfo 2025-05-07T20:22:52.1151164Z * branch 6543fec09b2f04ac4a666882998b534afc9c1349 -> FETCH_HEAD 2025-05-07T20:22:52.2143530Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349' 2025-05-07T20:22:53.3465004Z From https://github.com/jwfromm/cutlass 2025-05-07T20:22:53.3465465Z * branch 3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 -> FETCH_HEAD 2025-05-07T20:22:54.0497265Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3' 2025-05-07T20:22:54.8381038Z From https://github.com/google/googletest 2025-05-07T20:22:54.8381497Z * branch f8d7d77c06936315286eb55f8de22cd23c188571 -> FETCH_HEAD 2025-05-07T20:22:54.8790984Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571' 2025-05-07T20:22:55.4309527Z From https://github.com/ROCmSoftwarePlatform/hipify_torch 2025-05-07T20:22:55.4310032Z * branch 420084499c7c1e1c2d801922f40df202eac5f3a0 -> FETCH_HEAD 2025-05-07T20:22:55.4396031Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0' 2025-05-07T20:22:56.1735801Z From https://github.com/nlohmann/json 2025-05-07T20:22:56.1736252Z * branch 9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 -> FETCH_HEAD 2025-05-07T20:22:56.2872518Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03' 2025-05-07T20:22:56.2891614Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0 2025-05-07T20:22:56.3232499Z Entering 'external/asmjit' 2025-05-07T20:22:56.3265387Z Entering 'external/composable_kernel' 2025-05-07T20:22:56.3297224Z Entering 'external/cpuinfo' 2025-05-07T20:22:56.3330127Z Entering 'external/cutlass' 2025-05-07T20:22:56.3361893Z Entering 'external/googletest' 2025-05-07T20:22:56.3393603Z Entering 'external/hipify_torch' 2025-05-07T20:22:56.3426021Z Entering 'external/json' 2025-05-07T20:22:56.3470622Z ##[endgroup] 2025-05-07T20:22:56.3471182Z ##[group]Persisting credentials for submodules 2025-05-07T20:22:56.3478034Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2025-05-07T20:22:56.3813870Z Entering 'external/asmjit' 2025-05-07T20:22:56.3878872Z Entering 'external/composable_kernel' 2025-05-07T20:22:56.3951865Z Entering 'external/cpuinfo' 2025-05-07T20:22:56.4019657Z Entering 'external/cutlass' 2025-05-07T20:22:56.4093413Z Entering 'external/googletest' 2025-05-07T20:22:56.4160277Z Entering 'external/hipify_torch' 2025-05-07T20:22:56.4227072Z Entering 'external/json' 2025-05-07T20:22:56.4310654Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2025-05-07T20:22:56.4643585Z Entering 'external/asmjit' 2025-05-07T20:22:56.4705643Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config remote.origin.url 2025-05-07T20:22:56.4708318Z Entering 'external/composable_kernel' 2025-05-07T20:22:56.4771548Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config remote.origin.url 2025-05-07T20:22:56.4774558Z Entering 'external/cpuinfo' 2025-05-07T20:22:56.4838677Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config remote.origin.url 2025-05-07T20:22:56.4841247Z Entering 'external/cutlass' 2025-05-07T20:22:56.4907597Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config remote.origin.url 2025-05-07T20:22:56.4910373Z Entering 'external/googletest' 2025-05-07T20:22:56.4970833Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config remote.origin.url 2025-05-07T20:22:56.4973615Z Entering 'external/hipify_torch' 2025-05-07T20:22:56.5034398Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config remote.origin.url 2025-05-07T20:22:56.5037028Z Entering 'external/json' 2025-05-07T20:22:56.5096641Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config remote.origin.url 2025-05-07T20:22:56.5189397Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2025-05-07T20:22:56.5519554Z Entering 'external/asmjit' 2025-05-07T20:22:56.5552512Z Entering 'external/composable_kernel' 2025-05-07T20:22:56.5585299Z Entering 'external/cpuinfo' 2025-05-07T20:22:56.5617354Z Entering 'external/cutlass' 2025-05-07T20:22:56.5649067Z Entering 'external/googletest' 2025-05-07T20:22:56.5681140Z Entering 'external/hipify_torch' 2025-05-07T20:22:56.5715330Z Entering 'external/json' 2025-05-07T20:22:56.5764469Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2025-05-07T20:22:56.6098070Z Entering 'external/asmjit' 2025-05-07T20:22:56.6130425Z Entering 'external/composable_kernel' 2025-05-07T20:22:56.6163267Z Entering 'external/cpuinfo' 2025-05-07T20:22:56.6195602Z Entering 'external/cutlass' 2025-05-07T20:22:56.6228121Z Entering 'external/googletest' 2025-05-07T20:22:56.6260245Z Entering 'external/hipify_torch' 2025-05-07T20:22:56.6292374Z Entering 'external/json' 2025-05-07T20:22:56.6357069Z ##[endgroup] 2025-05-07T20:22:56.6385810Z [command]/usr/bin/git log -1 --format=%H 2025-05-07T20:22:56.6417123Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:22:56.6599965Z ##[group]Run actions/download-artifact@v4 2025-05-07T20:22:56.6600269Z with: 2025-05-07T20:22:56.6600503Z name: fbgemm_genai_x86_clang_py3.12_cu12.8.0.whl 2025-05-07T20:22:56.6600826Z merge-multiple: false 2025-05-07T20:22:56.6601072Z repository: pytorch/FBGEMM 2025-05-07T20:22:56.6601324Z run-id: 14891846252 2025-05-07T20:22:56.6601527Z env: 2025-05-07T20:22:56.6601739Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:56.6602032Z BUILD_ENV: build_binary 2025-05-07T20:22:56.6602282Z BUILD_TARGET: genai 2025-05-07T20:22:56.6602499Z BUILD_VARIANT: cuda 2025-05-07T20:22:56.6602732Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:22:56.6602982Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:56.6603220Z ##[endgroup] 2025-05-07T20:22:56.8889701Z Downloading single artifact 2025-05-07T20:22:56.9817692Z Preparing to download the following artifacts: 2025-05-07T20:22:56.9818552Z - fbgemm_genai_x86_clang_py3.12_cu12.8.0.whl (ID: 3081397670, Size: 18492313, Expected Digest: sha256:4144078f606f5674fd0d0827aa1139350c9dac781397a58fc2ce2aeb29225152) 2025-05-07T20:22:57.0508653Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-0953d042-3ee9-5e70-b1c3-e8d5865d7dd7/artifacts/59a2b3a78d1811f7aac902bf66636fb399e87795f56f74815a857bec15c93d16.zip 2025-05-07T20:22:57.0510118Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:22:57.1718207Z (node:66942) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead. 2025-05-07T20:22:57.1719191Z (Use `node --trace-deprecation ...` to show where the warning was created) 2025-05-07T20:22:57.4545631Z SHA256 digest of downloaded artifact is 4144078f606f5674fd0d0827aa1139350c9dac781397a58fc2ce2aeb29225152 2025-05-07T20:22:57.4546262Z Artifact download completed successfully. 2025-05-07T20:22:57.4546592Z Total of 1 artifact(s) downloaded 2025-05-07T20:22:57.4551701Z Download artifact has finished successfully 2025-05-07T20:22:57.4797201Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main 2025-05-07T20:22:57.4797600Z with: 2025-05-07T20:22:57.4797815Z driver-version: 570.133.07 2025-05-07T20:22:57.4798055Z env: 2025-05-07T20:22:57.4798273Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:57.4798575Z BUILD_ENV: build_binary 2025-05-07T20:22:57.4798810Z BUILD_TARGET: genai 2025-05-07T20:22:57.4799039Z BUILD_VARIANT: cuda 2025-05-07T20:22:57.4799273Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:22:57.4799519Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:57.4799759Z ##[endgroup] 2025-05-07T20:22:57.4889688Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2025-05-07T20:22:57.4890096Z with: 2025-05-07T20:22:57.4890553Z timeout_minutes: 10 2025-05-07T20:22:57.4890790Z max_attempts: 3 2025-05-07T20:22:57.4915356Z command: # Is it disgusting to have a full shell script here in this github action? Sure # But is it the best way to make it so that this action relies on nothing else? Absolutely set -eou pipefail DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run" install_nvidia_docker2_amzn2() { ( set -x # Needed for yum-config-manager sudo yum install -y yum-utils if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo" else # Amazon Linux 2 YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo" fi sudo yum-config-manager --add-repo "${YUM_REPO_URL}" sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker ) } install_nvidia_docker2_ubuntu20() { ( set -x # Install nvidia-driver package if not installed status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)" if [ ! $? = 0 ] || [ ! "$status" = installed ]; then sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker fi ) } pre_install_nvidia_driver_amzn2() { ( # Purge any nvidia driver installed from RHEL repo sudo yum remove -y nvidia-driver-latest-dkms ) } install_nvidia_driver_common() { ( # Try to gather more information about the runner and its existing NVIDIA driver if any echo "Before installing NVIDIA driver" lspci lsmod modinfo nvidia || true HAS_NVIDIA_DRIVER=0 # Check if NVIDIA driver has already been installed if [ -x "$(command -v nvidia-smi)" ]; then set +e # The driver exists, check its version next. Also check only the first GPU if there are more than one of them # so that the same driver version is not print over multiple lines INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing" elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing" # Turn off persistent mode so that the installation script can unload the kernel module sudo killall nvidia-persistenced || true else HAS_NVIDIA_DRIVER=1 echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation" fi set -e fi if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then # CAUTION: this may need to be updated in future if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then sudo yum groupinstall -y "Development Tools" # ensure our kernel install is the same as our underlying kernel, # groupinstall "Development Tools" has a habit of mismatching kernel headers sudo yum install -y "kernel-devel-uname-r == $(uname -r)" sudo modprobe backlight fi sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN" set +e sudo /bin/bash /tmp/nvidia_driver -s --no-drm NVIDIA_INSTALLATION_STATUS=$? RESET_GPU=0 if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then sudo cat /var/log/nvidia-installer.log # Fail to install NVIDIA driver, try to reset the GPU RESET_GPU=1 elif [ -x "$(command -v nvidia-smi)" ]; then # Check again if nvidia-smi works even if the driver installation completes successfully INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then RESET_GPU=1 fi fi if [ "$RESET_GPU" -eq 1 ]; then NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1) # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388 for PCI_ID in $NVIDIA_DEVICES; do DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable) echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)" # This requires sudo permission of course echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset sleep 1 done fi sudo rm -fv /tmp/nvidia_driver set -e fi ) } post_install_nvidia_driver_common() { ( sudo modprobe nvidia || true echo "After installing NVIDIA driver" lspci lsmod modinfo nvidia || true ( set +e nvidia-smi # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in # the case where the driver has already crashed as it still can get the driver version # and some basic information like the bus ID. However, the rest of the information # would be missing (ERR!), for example: # # +-----------------------------------------------------------------------------+ # | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | # |-------------------------------+----------------------+----------------------+ # | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | # | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | # | | | MIG M. | # |===============================+======================+======================| # | 0 ERR! Off | 00000000:00:1E.0 Off | ERR! | # |ERR! ERR! ERR! ERR! / ERR! | 4184MiB / 23028MiB | ERR! Default | # | | | ERR! | # +-------------------------------+----------------------+----------------------+ # # +-----------------------------------------------------------------------------+ # | Processes: | # | GPU GI CI PID Type Process name GPU Memory | # | ID ID Usage | # |=============================================================================| # +-----------------------------------------------------------------------------+ # # This should be reported as a failure instead as it will guarantee to fail when # Docker tries to run with --gpus all # # So, the correct check here is to query one of the missing piece of info like # GPU name, so that the command can fail accordingly nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 NVIDIA_SMI_STATUS=$? # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285 if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}" else echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}" exit ${NVIDIA_SMI_STATUS} fi set -e ) ) } install_nvidia_driver_amzn2() { ( set -x pre_install_nvidia_driver_amzn2 install_nvidia_driver_common post_install_nvidia_driver_common ) } install_nvidia_driver_ubuntu20() { ( set -x install_nvidia_driver_common post_install_nvidia_driver_common ) } echo "== Installing nvidia driver ${DRIVER_FN} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_driver_amzn2 ;; ubuntu20.04) install_nvidia_driver_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac # Install container toolkit based on distribution echo "== Installing nvidia container toolkit for ${DISTRIBUTION} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_docker2_amzn2 ;; ubuntu20.04) install_nvidia_docker2_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}" # Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with # more than one GPUs. This just needs to be run once. The command fails # on subsequent runs and complains that the mode is already on, but that's # ok sudo nvidia-persistenced || true # This should show persistence mode ON nvidia-smi 2025-05-07T20:22:57.4939716Z retry_wait_seconds: 10 2025-05-07T20:22:57.4939973Z polling_interval_seconds: 1 2025-05-07T20:22:57.4940237Z warning_on_retry: true 2025-05-07T20:22:57.4940487Z continue_on_error: false 2025-05-07T20:22:57.4940729Z env: 2025-05-07T20:22:57.4940949Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:57.4941254Z BUILD_ENV: build_binary 2025-05-07T20:22:57.4941498Z BUILD_TARGET: genai 2025-05-07T20:22:57.4941723Z BUILD_VARIANT: cuda 2025-05-07T20:22:57.4941964Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:22:57.4942224Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:57.4942466Z DRIVER_VERSION: 570.133.07 2025-05-07T20:22:57.4942732Z ##[endgroup] 2025-05-07T20:22:57.5637153Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run == 2025-05-07T20:22:57.5638130Z + pre_install_nvidia_driver_amzn2 2025-05-07T20:22:57.5640536Z + sudo yum remove -y nvidia-driver-latest-dkms 2025-05-07T20:22:57.9061169Z No match for argument: nvidia-driver-latest-dkms 2025-05-07T20:22:57.9062236Z No packages marked for removal. 2025-05-07T20:22:57.9125665Z Dependencies resolved. 2025-05-07T20:22:57.9135185Z Nothing to do. 2025-05-07T20:22:57.9135599Z Complete! 2025-05-07T20:22:57.9474108Z + install_nvidia_driver_common 2025-05-07T20:22:57.9478266Z + echo 'Before installing NVIDIA driver' 2025-05-07T20:22:57.9492362Z + lspci 2025-05-07T20:22:57.9492690Z Before installing NVIDIA driver 2025-05-07T20:22:57.9600345Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:22:57.9601079Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:22:57.9601649Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:22:57.9602173Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:22:57.9602666Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:22:57.9603200Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:22:57.9603753Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:22:57.9604322Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:22:57.9604732Z + lsmod 2025-05-07T20:22:57.9649456Z Module Size Used by 2025-05-07T20:22:57.9649829Z xt_nat 16384 0 2025-05-07T20:22:57.9650210Z nvidia_modeset 1716224 0 2025-05-07T20:22:57.9650598Z video 65536 1 nvidia_modeset 2025-05-07T20:22:57.9651050Z wmi 36864 1 video 2025-05-07T20:22:57.9651352Z nvidia_uvm 1884160 0 2025-05-07T20:22:57.9651655Z nvidia 11583488 7 nvidia_uvm,nvidia_modeset 2025-05-07T20:22:57.9652103Z drm 602112 1 nvidia 2025-05-07T20:22:57.9652540Z drm_panel_orientation_quirks 32768 1 drm 2025-05-07T20:22:57.9653035Z backlight 24576 3 video,drm,nvidia_modeset 2025-05-07T20:22:57.9653466Z i2c_core 110592 2 nvidia,drm 2025-05-07T20:22:57.9653745Z veth 36864 0 2025-05-07T20:22:57.9654000Z xt_conntrack 16384 1 2025-05-07T20:22:57.9654257Z nft_chain_nat 16384 3 2025-05-07T20:22:57.9654515Z xt_MASQUERADE 20480 1 2025-05-07T20:22:57.9654837Z nf_nat 57344 3 xt_nat,nft_chain_nat,xt_MASQUERADE 2025-05-07T20:22:57.9655186Z nf_conntrack_netlink 57344 0 2025-05-07T20:22:57.9655613Z nf_conntrack 184320 5 xt_conntrack,nf_nat,xt_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:22:57.9656082Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:22:57.9656403Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:22:57.9656701Z xfrm_user 57344 1 2025-05-07T20:22:57.9656961Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:22:57.9657253Z xt_addrtype 16384 2 2025-05-07T20:22:57.9657525Z nft_compat 20480 4 2025-05-07T20:22:57.9657829Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:22:57.9658250Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:22:57.9658624Z br_netfilter 36864 0 2025-05-07T20:22:57.9658897Z bridge 323584 1 br_netfilter 2025-05-07T20:22:57.9659197Z stp 16384 1 bridge 2025-05-07T20:22:57.9659492Z llc 16384 2 bridge,stp 2025-05-07T20:22:57.9659779Z overlay 167936 0 2025-05-07T20:22:57.9660037Z tls 135168 0 2025-05-07T20:22:57.9660294Z nls_ascii 16384 1 2025-05-07T20:22:57.9660552Z nls_cp437 20480 1 2025-05-07T20:22:57.9660798Z vfat 24576 1 2025-05-07T20:22:57.9661056Z fat 86016 1 vfat 2025-05-07T20:22:57.9661370Z sunrpc 696320 1 2025-05-07T20:22:57.9661690Z ena 180224 0 2025-05-07T20:22:57.9661942Z i8042 45056 0 2025-05-07T20:22:57.9662201Z serio 28672 3 i8042 2025-05-07T20:22:57.9662478Z ghash_clmulni_intel 16384 0 2025-05-07T20:22:57.9662748Z button 24576 0 2025-05-07T20:22:57.9663010Z sch_fq_codel 20480 17 2025-05-07T20:22:57.9663267Z dm_mod 188416 0 2025-05-07T20:22:57.9663520Z fuse 163840 1 2025-05-07T20:22:57.9663779Z loop 36864 0 2025-05-07T20:22:57.9664028Z configfs 57344 1 2025-05-07T20:22:57.9664292Z dax 45056 1 dm_mod 2025-05-07T20:22:57.9664575Z dmi_sysfs 20480 0 2025-05-07T20:22:57.9665222Z crc32_pclmul 16384 0 2025-05-07T20:22:57.9665493Z crc32c_intel 24576 0 2025-05-07T20:22:57.9665755Z efivarfs 24576 1 2025-05-07T20:22:57.9666096Z + modinfo nvidia 2025-05-07T20:22:57.9668305Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:22:57.9668951Z import_ns: DMA_BUF 2025-05-07T20:22:57.9669293Z alias: char-major-195-* 2025-05-07T20:22:57.9669648Z version: 570.133.07 2025-05-07T20:22:57.9669906Z supported: external 2025-05-07T20:22:57.9670158Z license: Dual MIT/GPL 2025-05-07T20:22:57.9670495Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:22:57.9670987Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:22:57.9671631Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:22:57.9671972Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:22:57.9672350Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:22:57.9672719Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:22:57.9673041Z depends: i2c-core,drm 2025-05-07T20:22:57.9673307Z retpoline: Y 2025-05-07T20:22:57.9673539Z name: nvidia 2025-05-07T20:22:57.9673904Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:22:57.9674539Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:22:57.9675157Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:22:57.9675580Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:22:57.9675895Z parm: NVreg_RmLogonRC:int 2025-05-07T20:22:57.9676205Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:22:57.9676527Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:22:57.9676837Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:22:57.9677224Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:22:57.9677722Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:22:57.9678251Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:22:57.9678591Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:22:57.9678896Z parm: NVreg_EnableMSI:int 2025-05-07T20:22:57.9679201Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:22:57.9679577Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:22:57.9679984Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:22:57.9680363Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:22:57.9680784Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:57.9681281Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:22:57.9681852Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:57.9682362Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:22:57.9682718Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:22:57.9683107Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:22:57.9683485Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:22:57.9683834Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:22:57.9684169Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:22:57.9684503Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:22:57.9684847Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:22:57.9685169Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:22:57.9685520Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:22:57.9685900Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:22:57.9686243Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:22:57.9686592Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:22:57.9686950Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:22:57.9687300Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:22:57.9687660Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:22:57.9689503Z parm: NVreg_RmMsg:charp 2025-05-07T20:22:57.9689812Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:22:57.9690151Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:22:57.9690479Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:22:57.9690803Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:22:57.9691143Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:22:57.9691502Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:22:57.9691864Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:22:57.9692288Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:22:57.9692643Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:22:57.9692985Z parm: rm_firmware_active:charp 2025-05-07T20:22:57.9693394Z + HAS_NVIDIA_DRIVER=0 2025-05-07T20:22:57.9693651Z ++ command -v nvidia-smi 2025-05-07T20:22:57.9693921Z + '[' -x /usr/bin/nvidia-smi ']' 2025-05-07T20:22:57.9694187Z + set +e 2025-05-07T20:22:57.9694508Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0 2025-05-07T20:22:57.9908217Z + INSTALLED_DRIVER_VERSION=570.133.07 2025-05-07T20:22:57.9908633Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:22:57.9908962Z + '[' 0 -ne 0 ']' 2025-05-07T20:22:57.9909274Z + '[' 570.133.07 '!=' 570.133.07 ']' 2025-05-07T20:22:57.9909634Z + HAS_NVIDIA_DRIVER=1 2025-05-07T20:22:57.9910187Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation' 2025-05-07T20:22:57.9910785Z + set -e 2025-05-07T20:22:57.9911054Z + '[' 1 -eq 0 ']' 2025-05-07T20:22:57.9911578Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation 2025-05-07T20:22:57.9912147Z + post_install_nvidia_driver_common 2025-05-07T20:22:57.9915058Z + sudo modprobe nvidia 2025-05-07T20:22:58.1506539Z + echo 'After installing NVIDIA driver' 2025-05-07T20:22:58.1506972Z + lspci 2025-05-07T20:22:58.1507267Z After installing NVIDIA driver 2025-05-07T20:22:58.1622708Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:22:58.1623384Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:22:58.1624028Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:22:58.1624552Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:22:58.1625051Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:22:58.1625788Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:22:58.1626445Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:22:58.1626933Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:22:58.1627352Z + lsmod 2025-05-07T20:22:58.1655615Z Module Size Used by 2025-05-07T20:22:58.1656058Z xt_nat 16384 0 2025-05-07T20:22:58.1656450Z nvidia_modeset 1716224 0 2025-05-07T20:22:58.1656842Z video 65536 1 nvidia_modeset 2025-05-07T20:22:58.1657247Z wmi 36864 1 video 2025-05-07T20:22:58.1657524Z nvidia_uvm 1884160 0 2025-05-07T20:22:58.1657947Z nvidia 11583488 7 nvidia_uvm,nvidia_modeset 2025-05-07T20:22:58.1658392Z drm 602112 1 nvidia 2025-05-07T20:22:58.1658798Z drm_panel_orientation_quirks 32768 1 drm 2025-05-07T20:22:58.1659191Z backlight 24576 3 video,drm,nvidia_modeset 2025-05-07T20:22:58.1659538Z i2c_core 110592 2 nvidia,drm 2025-05-07T20:22:58.1659826Z veth 36864 0 2025-05-07T20:22:58.1660076Z xt_conntrack 16384 1 2025-05-07T20:22:58.1660333Z nft_chain_nat 16384 3 2025-05-07T20:22:58.1660603Z xt_MASQUERADE 20480 1 2025-05-07T20:22:58.1660910Z nf_nat 57344 3 xt_nat,nft_chain_nat,xt_MASQUERADE 2025-05-07T20:22:58.1661259Z nf_conntrack_netlink 57344 0 2025-05-07T20:22:58.1661915Z nf_conntrack 184320 5 xt_conntrack,nf_nat,xt_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:22:58.1662383Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:22:58.1662702Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:22:58.1662999Z xfrm_user 57344 1 2025-05-07T20:22:58.1663269Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:22:58.1663557Z xt_addrtype 16384 2 2025-05-07T20:22:58.1663816Z nft_compat 20480 4 2025-05-07T20:22:58.1664114Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:22:58.1664527Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:22:58.1664910Z br_netfilter 36864 0 2025-05-07T20:22:58.1665189Z bridge 323584 1 br_netfilter 2025-05-07T20:22:58.1665620Z stp 16384 1 bridge 2025-05-07T20:22:58.1665906Z llc 16384 2 bridge,stp 2025-05-07T20:22:58.1666191Z overlay 167936 0 2025-05-07T20:22:58.1666441Z tls 135168 0 2025-05-07T20:22:58.1666692Z nls_ascii 16384 1 2025-05-07T20:22:58.1666946Z nls_cp437 20480 1 2025-05-07T20:22:58.1667186Z vfat 24576 1 2025-05-07T20:22:58.1667435Z fat 86016 1 vfat 2025-05-07T20:22:58.1667700Z sunrpc 696320 1 2025-05-07T20:22:58.1667983Z ena 180224 0 2025-05-07T20:22:58.1668226Z i8042 45056 0 2025-05-07T20:22:58.1668472Z serio 28672 3 i8042 2025-05-07T20:22:58.1668748Z ghash_clmulni_intel 16384 0 2025-05-07T20:22:58.1669011Z button 24576 0 2025-05-07T20:22:58.1669262Z sch_fq_codel 20480 17 2025-05-07T20:22:58.1669519Z dm_mod 188416 0 2025-05-07T20:22:58.1669778Z fuse 163840 1 2025-05-07T20:22:58.1670017Z loop 36864 0 2025-05-07T20:22:58.1670267Z configfs 57344 1 2025-05-07T20:22:58.1670522Z dax 45056 1 dm_mod 2025-05-07T20:22:58.1670799Z dmi_sysfs 20480 0 2025-05-07T20:22:58.1671046Z crc32_pclmul 16384 0 2025-05-07T20:22:58.1671301Z crc32c_intel 24576 0 2025-05-07T20:22:58.1671551Z efivarfs 24576 1 2025-05-07T20:22:58.1671798Z + modinfo nvidia 2025-05-07T20:22:58.1672854Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:22:58.1673480Z import_ns: DMA_BUF 2025-05-07T20:22:58.1673810Z alias: char-major-195-* 2025-05-07T20:22:58.1674196Z version: 570.133.07 2025-05-07T20:22:58.1674477Z supported: external 2025-05-07T20:22:58.1674729Z license: Dual MIT/GPL 2025-05-07T20:22:58.1675013Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:22:58.1675366Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:22:58.1675687Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:22:58.1675999Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:22:58.1676342Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:22:58.1676682Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:22:58.1677002Z depends: i2c-core,drm 2025-05-07T20:22:58.1677253Z retpoline: Y 2025-05-07T20:22:58.1677473Z name: nvidia 2025-05-07T20:22:58.1677899Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:22:58.1678544Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:22:58.1679148Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:22:58.1679615Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:22:58.1679919Z parm: NVreg_RmLogonRC:int 2025-05-07T20:22:58.1680221Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:22:58.1680543Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:22:58.1680841Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:22:58.1681150Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:22:58.1681637Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:22:58.1682033Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:22:58.1682361Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:22:58.1682664Z parm: NVreg_EnableMSI:int 2025-05-07T20:22:58.1682969Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:22:58.1683326Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:22:58.1683727Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:22:58.1684106Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:22:58.1684515Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:58.1684927Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:22:58.1685498Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:58.1685912Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:22:58.1686249Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:22:58.1686618Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:22:58.1686992Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:22:58.1687324Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:22:58.1687648Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:22:58.1687980Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:22:58.1688299Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:22:58.1688610Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:22:58.1688957Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:22:58.1689316Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:22:58.1689641Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:22:58.1689985Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:22:58.1690334Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:22:58.1690665Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:22:58.1691015Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:22:58.1691348Z parm: NVreg_RmMsg:charp 2025-05-07T20:22:58.1691631Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:22:58.1692022Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:22:58.1692353Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:22:58.1692664Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:22:58.1693000Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:22:58.1693358Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:22:58.1693710Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:22:58.1694031Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:22:58.1694379Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:22:58.1694726Z parm: rm_firmware_active:charp 2025-05-07T20:22:58.1695006Z + set +e 2025-05-07T20:22:58.1695200Z + nvidia-smi 2025-05-07T20:22:58.1851428Z Wed May 7 20:22:58 2025 2025-05-07T20:22:58.1852006Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:22:58.1852711Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:22:58.1853294Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:22:58.1853795Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:22:58.1854346Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:22:58.1854927Z | | | MIG M. | 2025-05-07T20:22:58.1855270Z |=========================================+========================+======================| 2025-05-07T20:22:58.1990022Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:22:58.1990866Z | 0% 28C P8 10W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:22:58.1991399Z | | | N/A | 2025-05-07T20:22:58.1991809Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:22:58.1994752Z 2025-05-07T20:22:58.1995331Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:22:58.1995917Z | Processes: | 2025-05-07T20:22:58.1996363Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:22:58.1996953Z | ID ID Usage | 2025-05-07T20:22:58.1997306Z |=========================================================================================| 2025-05-07T20:22:58.1999705Z | No running processes found | 2025-05-07T20:22:58.2000366Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:22:58.4643220Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 2025-05-07T20:22:58.4808992Z NVIDIA A10G 2025-05-07T20:22:58.4850053Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:22:58.4851022Z + '[' 0 -eq 0 ']' 2025-05-07T20:22:58.4851321Z + echo 'INFO: Ignoring allowed status 0' 2025-05-07T20:22:58.4851612Z + set -e 2025-05-07T20:22:58.4851819Z INFO: Ignoring allowed status 0 2025-05-07T20:22:58.4859346Z == Installing nvidia container toolkit for amzn2023 == 2025-05-07T20:22:58.4863197Z + sudo yum install -y yum-utils 2025-05-07T20:22:58.9122622Z Last metadata expiration check: 0:09:03 ago on Wed May 7 20:13:55 2025. 2025-05-07T20:22:58.9366897Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed. 2025-05-07T20:22:58.9762175Z Dependencies resolved. 2025-05-07T20:22:58.9943895Z Nothing to do. 2025-05-07T20:22:58.9945011Z Complete! 2025-05-07T20:22:59.0333721Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]] 2025-05-07T20:22:59.0334330Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:22:59.0335204Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:22:59.3495180Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:22:59.4097089Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 2025-05-07T20:22:59.9287126Z nvidia-container-toolkit 14 kB/s | 833 B 00:00 2025-05-07T20:22:59.9532922Z Package nvidia-docker2-2.14.0-1.noarch is already installed. 2025-05-07T20:22:59.9538351Z Package nvidia-container-toolkit-1.16.2-1.x86_64 is already installed. 2025-05-07T20:22:59.9927614Z Dependencies resolved. 2025-05-07T20:23:00.0111358Z Nothing to do. 2025-05-07T20:23:00.0112278Z Complete! 2025-05-07T20:23:00.0499567Z + sudo systemctl restart docker 2025-05-07T20:23:03.5574469Z nvidia-persistenced failed to initialize. Check syslog for more details. 2025-05-07T20:23:03.5774103Z Wed May 7 20:23:03 2025 2025-05-07T20:23:03.5774796Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:03.5775401Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:03.5775894Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:03.5776390Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:03.5776968Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:03.5777410Z | | | MIG M. | 2025-05-07T20:23:03.5778037Z |=========================================+========================+======================| 2025-05-07T20:23:03.5910620Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:03.5911067Z | 0% 28C P8 11W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:23:03.5911460Z | | | N/A | 2025-05-07T20:23:03.5911865Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:03.5915664Z 2025-05-07T20:23:03.5916084Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:03.5916756Z | Processes: | 2025-05-07T20:23:03.5917216Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:03.5917639Z | ID ID Usage | 2025-05-07T20:23:03.5917989Z |=========================================================================================| 2025-05-07T20:23:03.5920729Z | No running processes found | 2025-05-07T20:23:03.5921213Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:04.5447750Z Command completed after 1 attempt(s). 2025-05-07T20:23:04.5550179Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:04.5550659Z . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:04.5564183Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:04.5564566Z env: 2025-05-07T20:23:04.5564807Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:04.5565103Z BUILD_ENV: build_binary 2025-05-07T20:23:04.5565346Z BUILD_TARGET: genai 2025-05-07T20:23:04.5565566Z BUILD_VARIANT: cuda 2025-05-07T20:23:04.5565792Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:04.5566044Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:04.5566342Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:04.5566668Z ##[endgroup] 2025-05-07T20:23:04.8950170Z ################################################################################ 2025-05-07T20:23:04.8950515Z # Print System Info 2025-05-07T20:23:04.8950738Z # 2025-05-07T20:23:04.8967073Z # [2025-05-07T20:23:04.896Z] + print_system_info 2025-05-07T20:23:04.8967432Z ################################################################################ 2025-05-07T20:23:04.8967651Z 2025-05-07T20:23:04.8967773Z ################################################################################ 2025-05-07T20:23:04.8968100Z [INFO] Printing environment variables ... 2025-05-07T20:23:04.8968394Z + printenv 2025-05-07T20:23:04.8968511Z 2025-05-07T20:23:04.8994483Z SHELL=/bin/bash 2025-05-07T20:23:04.8995448Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:04.8996023Z BUILD_VARIANT=cuda 2025-05-07T20:23:04.8996749Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_4719fd55-4f3e-4a7e-8f0c-08077e88b123 2025-05-07T20:23:04.8997364Z GITHUB_ACTION=__run 2025-05-07T20:23:04.8997652Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:04.8997998Z GITHUB_RUN_NUMBER=10601 2025-05-07T20:23:04.8998260Z RUNNER_NAME=i-0a11e2b4e0c9387f6 2025-05-07T20:23:04.8998567Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-05-07T20:23:04.8998872Z PLATFORM_NAME_LC=linux-x86_64 2025-05-07T20:23:04.8999148Z MACHINE_NAME_LC=x86_64 2025-05-07T20:23:04.8999526Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh 2025-05-07T20:23:04.8999981Z GITHUB_TRIGGERING_ACTOR=q10 2025-05-07T20:23:04.9000258Z PRELUDE=.github/scripts/setup_env.bash 2025-05-07T20:23:04.9000553Z GITHUB_REF_TYPE=branch 2025-05-07T20:23:04.9001487Z *** 2025-05-07T20:23:04.9001681Z LOGNAME=ec2-user 2025-05-07T20:23:04.9001937Z GITHUB_REPOSITORY_ID=150154628 2025-05-07T20:23:04.9002200Z ENFORCE_CUDA_DEVICE=1 2025-05-07T20:23:04.9002430Z GITHUB_ACTIONS=true 2025-05-07T20:23:04.9002647Z SYSTEMD_EXEC_PID=55524 2025-05-07T20:23:04.9002929Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:23:04.9003481Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge 2025-05-07T20:23:04.9004002Z RUNNER_ENVIRONMENT=self-hosted 2025-05-07T20:23:04.9004288Z GITHUB_REF=refs/pull/4066/merge 2025-05-07T20:23:04.9004547Z RUNNER_OS=Linux 2025-05-07T20:23:04.9004764Z GITHUB_REF_PROTECTED=false 2025-05-07T20:23:04.9005009Z HOME=/home/ec2-user 2025-05-07T20:23:04.9005595Z GITHUB_API_URL=https://api.github.com 2025-05-07T20:23:04.9005886Z LANG=C.UTF-8 2025-05-07T20:23:04.9006481Z RUNNER_TRACKING_ID=github_3ac5b347-36c5-4052-8a78-c74e0ef3d3fd 2025-05-07T20:23:04.9006968Z RUNNER_ARCH=X64 2025-05-07T20:23:04.9007250Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp 2025-05-07T20:23:04.9007583Z BUILD_TARGET=genai 2025-05-07T20:23:04.9008125Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_4719fd55-4f3e-4a7e-8f0c-08077e88b123 2025-05-07T20:23:04.9009023Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_4719fd55-4f3e-4a7e-8f0c-08077e88b123 2025-05-07T20:23:04.9009778Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json 2025-05-07T20:23:04.9010492Z INVOCATION_ID=4ac64003978b4062acf61afbbb55318a 2025-05-07T20:23:04.9010833Z GITHUB_EVENT_NAME=pull_request 2025-05-07T20:23:04.9011105Z GITHUB_RUN_ID=14891846252 2025-05-07T20:23:04.9011694Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_4719fd55-4f3e-4a7e-8f0c-08077e88b123 2025-05-07T20:23:04.9012413Z BUILD_ENV=build_binary 2025-05-07T20:23:04.9012650Z GITHUB_ACTOR=q10 2025-05-07T20:23:04.9012868Z GITHUB_RUN_ATTEMPT=1 2025-05-07T20:23:04.9013099Z KERN_NAME_LC=linux 2025-05-07T20:23:04.9013326Z BUILD_CUDA_VERSION=12.8.0 2025-05-07T20:23:04.9013626Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-05-07T20:23:04.9013975Z PLATFORM_NAME=Linux-x86_64 2025-05-07T20:23:04.9014226Z USER=ec2-user 2025-05-07T20:23:04.9014454Z GITHUB_SERVER_URL=https://github.com 2025-05-07T20:23:04.9014759Z SHLVL=1 2025-05-07T20:23:04.9014980Z GITHUB_ACTOR_ID=255046 2025-05-07T20:23:04.9015288Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool 2025-05-07T20:23:04.9015739Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e 2025-05-07T20:23:04.9016108Z GITHUB_REF_NAME=4066/merge 2025-05-07T20:23:04.9016349Z KERN_NAME=Linux 2025-05-07T20:23:04.9016575Z GITHUB_JOB=test_and_publish_artifact 2025-05-07T20:23:04.9016996Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh 2025-05-07T20:23:04.9017440Z GITHUB_REPOSITORY=pytorch/FBGEMM 2025-05-07T20:23:04.9017711Z GITHUB_RETENTION_DAYS=90 2025-05-07T20:23:04.9018003Z JOURNAL_STREAM=8:90754 2025-05-07T20:23:04.9018322Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM 2025-05-07T20:23:04.9018689Z GITHUB_ACTION_REPOSITORY= 2025-05-07T20:23:04.9019004Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin 2025-05-07T20:23:04.9019341Z GITHUB_BASE_REF=main 2025-05-07T20:23:04.9019559Z CI=true 2025-05-07T20:23:04.9019771Z GITHUB_REPOSITORY_OWNER=pytorch 2025-05-07T20:23:04.9020059Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6 2025-05-07T20:23:04.9020340Z GITHUB_ACTION_REF= 2025-05-07T20:23:04.9020595Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI 2025-05-07T20:23:04.9021220Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_4719fd55-4f3e-4a7e-8f0c-08077e88b123 2025-05-07T20:23:04.9021824Z MACHINE_NAME=x86_64 2025-05-07T20:23:04.9022049Z _=/usr/bin/printenv 2025-05-07T20:23:04.9022191Z 2025-05-07T20:23:04.9022311Z ################################################################################ 2025-05-07T20:23:04.9022637Z [INFO] Print ldd version ... 2025-05-07T20:23:04.9022894Z + ldd --version 2025-05-07T20:23:04.9023029Z 2025-05-07T20:23:04.9023134Z ldd (GNU libc) 2.34 2025-05-07T20:23:04.9023408Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:23:04.9023872Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:23:04.9024418Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:23:04.9024883Z Written by Roland McGrath and Ulrich Drepper. 2025-05-07T20:23:04.9025108Z 2025-05-07T20:23:04.9025235Z ################################################################################ 2025-05-07T20:23:04.9025549Z [INFO] Print CPU info ... 2025-05-07T20:23:04.9025794Z + nproc 2025-05-07T20:23:04.9026058Z 2025-05-07T20:23:04.9043860Z 16 2025-05-07T20:23:04.9045796Z 2025-05-07T20:23:04.9046004Z + lscpu 2025-05-07T20:23:04.9046135Z 2025-05-07T20:23:04.9161175Z Architecture: x86_64 2025-05-07T20:23:04.9161573Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:23:04.9162100Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:23:04.9162648Z Byte Order: Little Endian 2025-05-07T20:23:04.9163104Z CPU(s): 16 2025-05-07T20:23:04.9163525Z On-line CPU(s) list: 0-15 2025-05-07T20:23:04.9163988Z Vendor ID: AuthenticAMD 2025-05-07T20:23:04.9164460Z Model name: AMD EPYC 7R32 2025-05-07T20:23:04.9164881Z CPU family: 23 2025-05-07T20:23:04.9165513Z Model: 49 2025-05-07T20:23:04.9165947Z Thread(s) per core: 2 2025-05-07T20:23:04.9166357Z Core(s) per socket: 8 2025-05-07T20:23:04.9166759Z Socket(s): 1 2025-05-07T20:23:04.9167064Z Stepping: 0 2025-05-07T20:23:04.9167367Z BogoMIPS: 5600.00 2025-05-07T20:23:04.9169545Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:04.9171716Z Hypervisor vendor: KVM 2025-05-07T20:23:04.9172111Z Virtualization type: full 2025-05-07T20:23:04.9172457Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:23:04.9172822Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:23:04.9173180Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:23:04.9173532Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:23:04.9173859Z NUMA node(s): 1 2025-05-07T20:23:04.9174159Z NUMA node0 CPU(s): 0-15 2025-05-07T20:23:04.9174548Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:23:04.9174929Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:23:04.9175294Z Vulnerability L1tf: Not affected 2025-05-07T20:23:04.9175650Z Vulnerability Mds: Not affected 2025-05-07T20:23:04.9176018Z Vulnerability Meltdown: Not affected 2025-05-07T20:23:04.9176387Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:23:04.9176758Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:23:04.9177389Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:23:04.9178126Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:23:04.9178686Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:23:04.9179378Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:23:04.9180403Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:23:04.9181416Z Vulnerability Srbds: Not affected 2025-05-07T20:23:04.9181978Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:23:04.9182317Z 2025-05-07T20:23:04.9182448Z + cat /proc/cpuinfo 2025-05-07T20:23:04.9182645Z 2025-05-07T20:23:04.9182854Z processor : 0 2025-05-07T20:23:04.9183385Z vendor_id : AuthenticAMD 2025-05-07T20:23:04.9183703Z cpu family : 23 2025-05-07T20:23:04.9183993Z model : 49 2025-05-07T20:23:04.9184285Z model name : AMD EPYC 7R32 2025-05-07T20:23:04.9184617Z stepping : 0 2025-05-07T20:23:04.9184907Z microcode : 0x830107f 2025-05-07T20:23:04.9185183Z cpu MHz : 3295.310 2025-05-07T20:23:04.9185391Z cache size : 512 KB 2025-05-07T20:23:04.9185608Z physical id : 0 2025-05-07T20:23:04.9185816Z siblings : 16 2025-05-07T20:23:04.9186021Z core id : 0 2025-05-07T20:23:04.9186215Z cpu cores : 8 2025-05-07T20:23:04.9186415Z apicid : 0 2025-05-07T20:23:04.9186614Z initial apicid : 0 2025-05-07T20:23:04.9186819Z fpu : yes 2025-05-07T20:23:04.9187016Z fpu_exception : yes 2025-05-07T20:23:04.9187235Z cpuid level : 13 2025-05-07T20:23:04.9187436Z wp : yes 2025-05-07T20:23:04.9189617Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:04.9191958Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:04.9192457Z bogomips : 5600.00 2025-05-07T20:23:04.9192674Z TLB size : 3072 4K pages 2025-05-07T20:23:04.9192914Z clflush size : 64 2025-05-07T20:23:04.9193132Z cache_alignment : 64 2025-05-07T20:23:04.9193400Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:04.9193729Z power management: 2025-05-07T20:23:04.9193867Z 2025-05-07T20:23:04.9193952Z processor : 1 2025-05-07T20:23:04.9194172Z vendor_id : AuthenticAMD 2025-05-07T20:23:04.9194407Z cpu family : 23 2025-05-07T20:23:04.9194619Z model : 49 2025-05-07T20:23:04.9194826Z model name : AMD EPYC 7R32 2025-05-07T20:23:04.9195066Z stepping : 0 2025-05-07T20:23:04.9195277Z microcode : 0x830107f 2025-05-07T20:23:04.9195506Z cpu MHz : 3298.873 2025-05-07T20:23:04.9195721Z cache size : 512 KB 2025-05-07T20:23:04.9195932Z physical id : 0 2025-05-07T20:23:04.9196139Z siblings : 16 2025-05-07T20:23:04.9196339Z core id : 1 2025-05-07T20:23:04.9196538Z cpu cores : 8 2025-05-07T20:23:04.9196734Z apicid : 2 2025-05-07T20:23:04.9196929Z initial apicid : 2 2025-05-07T20:23:04.9197140Z fpu : yes 2025-05-07T20:23:04.9197345Z fpu_exception : yes 2025-05-07T20:23:04.9197557Z cpuid level : 13 2025-05-07T20:23:04.9197765Z wp : yes 2025-05-07T20:23:04.9199793Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:04.9202112Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:04.9202609Z bogomips : 5600.00 2025-05-07T20:23:04.9202821Z TLB size : 3072 4K pages 2025-05-07T20:23:04.9203055Z clflush size : 64 2025-05-07T20:23:04.9203271Z cache_alignment : 64 2025-05-07T20:23:04.9203537Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:04.9203854Z power management: 2025-05-07T20:23:04.9203992Z 2025-05-07T20:23:04.9204086Z processor : 2 2025-05-07T20:23:04.9204296Z vendor_id : AuthenticAMD 2025-05-07T20:23:04.9204535Z cpu family : 23 2025-05-07T20:23:04.9204745Z model : 49 2025-05-07T20:23:04.9205032Z model name : AMD EPYC 7R32 2025-05-07T20:23:04.9205272Z stepping : 0 2025-05-07T20:23:04.9205479Z microcode : 0x830107f 2025-05-07T20:23:04.9205696Z cpu MHz : 3306.374 2025-05-07T20:23:04.9205909Z cache size : 512 KB 2025-05-07T20:23:04.9206123Z physical id : 0 2025-05-07T20:23:04.9206596Z siblings : 16 2025-05-07T20:23:04.9206802Z core id : 2 2025-05-07T20:23:04.9207002Z cpu cores : 8 2025-05-07T20:23:04.9207195Z apicid : 4 2025-05-07T20:23:04.9207393Z initial apicid : 4 2025-05-07T20:23:04.9207607Z fpu : yes 2025-05-07T20:23:04.9207800Z fpu_exception : yes 2025-05-07T20:23:04.9208018Z cpuid level : 13 2025-05-07T20:23:04.9208234Z wp : yes 2025-05-07T20:23:04.9210400Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:04.9212808Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:04.9213315Z bogomips : 5600.00 2025-05-07T20:23:04.9213539Z TLB size : 3072 4K pages 2025-05-07T20:23:04.9213782Z clflush size : 64 2025-05-07T20:23:04.9213998Z cache_alignment : 64 2025-05-07T20:23:04.9214278Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:04.9214601Z power management: 2025-05-07T20:23:04.9214758Z 2025-05-07T20:23:04.9214849Z processor : 3 2025-05-07T20:23:04.9215083Z vendor_id : AuthenticAMD 2025-05-07T20:23:04.9215330Z cpu family : 23 2025-05-07T20:23:04.9215534Z model : 49 2025-05-07T20:23:04.9215740Z model name : AMD EPYC 7R32 2025-05-07T20:23:04.9215981Z stepping : 0 2025-05-07T20:23:04.9216188Z microcode : 0x830107f 2025-05-07T20:23:04.9216414Z cpu MHz : 3299.309 2025-05-07T20:23:04.9216625Z cache size : 512 KB 2025-05-07T20:23:04.9216832Z physical id : 0 2025-05-07T20:23:04.9217043Z siblings : 16 2025-05-07T20:23:04.9217246Z core id : 3 2025-05-07T20:23:04.9217443Z cpu cores : 8 2025-05-07T20:23:04.9217634Z apicid : 6 2025-05-07T20:23:04.9217833Z initial apicid : 6 2025-05-07T20:23:04.9218045Z fpu : yes 2025-05-07T20:23:04.9218238Z fpu_exception : yes 2025-05-07T20:23:04.9218453Z cpuid level : 13 2025-05-07T20:23:04.9218658Z wp : yes 2025-05-07T20:23:04.9220668Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:04.9222976Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:04.9223479Z bogomips : 5600.00 2025-05-07T20:23:04.9223700Z TLB size : 3072 4K pages 2025-05-07T20:23:04.9223930Z clflush size : 64 2025-05-07T20:23:04.9224144Z cache_alignment : 64 2025-05-07T20:23:04.9224417Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:04.9224726Z power management: 2025-05-07T20:23:04.9224866Z 2025-05-07T20:23:04.9224948Z processor : 4 2025-05-07T20:23:04.9225162Z vendor_id : AuthenticAMD 2025-05-07T20:23:04.9225399Z cpu family : 23 2025-05-07T20:23:04.9225599Z model : 49 2025-05-07T20:23:04.9225808Z model name : AMD EPYC 7R32 2025-05-07T20:23:04.9226051Z stepping : 0 2025-05-07T20:23:04.9226252Z microcode : 0x830107f 2025-05-07T20:23:04.9226603Z cpu MHz : 3294.123 2025-05-07T20:23:04.9226817Z cache size : 512 KB 2025-05-07T20:23:04.9227027Z physical id : 0 2025-05-07T20:23:04.9227239Z siblings : 16 2025-05-07T20:23:04.9227444Z core id : 4 2025-05-07T20:23:04.9227638Z cpu cores : 8 2025-05-07T20:23:04.9227842Z apicid : 8 2025-05-07T20:23:04.9228040Z initial apicid : 8 2025-05-07T20:23:04.9228247Z fpu : yes 2025-05-07T20:23:04.9228495Z fpu_exception : yes 2025-05-07T20:23:04.9228715Z cpuid level : 13 2025-05-07T20:23:04.9228915Z wp : yes 2025-05-07T20:23:04.9231021Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:04.9233488Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:04.9233991Z bogomips : 5600.00 2025-05-07T20:23:04.9234208Z TLB size : 3072 4K pages 2025-05-07T20:23:04.9234437Z clflush size : 64 2025-05-07T20:23:04.9234651Z cache_alignment : 64 2025-05-07T20:23:04.9234921Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:04.9235235Z power management: 2025-05-07T20:23:04.9235370Z 2025-05-07T20:23:04.9235453Z processor : 5 2025-05-07T20:23:04.9235664Z vendor_id : AuthenticAMD 2025-05-07T20:23:04.9235893Z cpu family : 23 2025-05-07T20:23:04.9236098Z model : 49 2025-05-07T20:23:04.9236305Z model name : AMD EPYC 7R32 2025-05-07T20:23:04.9236546Z stepping : 0 2025-05-07T20:23:04.9236750Z microcode : 0x830107f 2025-05-07T20:23:04.9236972Z cpu MHz : 3300.744 2025-05-07T20:23:04.9237182Z cache size : 512 KB 2025-05-07T20:23:04.9237403Z physical id : 0 2025-05-07T20:23:04.9237612Z siblings : 16 2025-05-07T20:23:04.9237809Z core id : 5 2025-05-07T20:23:04.9238006Z cpu cores : 8 2025-05-07T20:23:04.9238211Z apicid : 10 2025-05-07T20:23:04.9238406Z initial apicid : 10 2025-05-07T20:23:04.9238616Z fpu : yes 2025-05-07T20:23:04.9238812Z fpu_exception : yes 2025-05-07T20:23:04.9239022Z cpuid level : 13 2025-05-07T20:23:04.9239227Z wp : yes 2025-05-07T20:23:04.9241257Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:04.9243570Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:04.9244068Z bogomips : 5600.00 2025-05-07T20:23:04.9244281Z TLB size : 3072 4K pages 2025-05-07T20:23:04.9244512Z clflush size : 64 2025-05-07T20:23:04.9244736Z cache_alignment : 64 2025-05-07T20:23:04.9267500Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:04.9267883Z power management: 2025-05-07T20:23:04.9268056Z 2025-05-07T20:23:04.9268143Z processor : 6 2025-05-07T20:23:04.9268367Z vendor_id : AuthenticAMD 2025-05-07T20:23:04.9268616Z cpu family : 23 2025-05-07T20:23:04.9268820Z model : 49 2025-05-07T20:23:04.9269030Z model name : AMD EPYC 7R32 2025-05-07T20:23:04.9269276Z stepping : 0 2025-05-07T20:23:04.9269480Z microcode : 0x830107f 2025-05-07T20:23:04.9269720Z cpu MHz : 3299.915 2025-05-07T20:23:04.9269939Z cache size : 512 KB 2025-05-07T20:23:04.9270151Z physical id : 0 2025-05-07T20:23:04.9270365Z siblings : 16 2025-05-07T20:23:04.9270691Z core id : 6 2025-05-07T20:23:04.9270885Z cpu cores : 8 2025-05-07T20:23:04.9271090Z apicid : 12 2025-05-07T20:23:04.9271298Z initial apicid : 12 2025-05-07T20:23:04.9271505Z fpu : yes 2025-05-07T20:23:04.9271712Z fpu_exception : yes 2025-05-07T20:23:04.9271931Z cpuid level : 13 2025-05-07T20:23:04.9272134Z wp : yes 2025-05-07T20:23:04.9274302Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:04.9276625Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:04.9277135Z bogomips : 5600.00 2025-05-07T20:23:04.9277363Z TLB size : 3072 4K pages 2025-05-07T20:23:04.9277599Z clflush size : 64 2025-05-07T20:23:04.9277820Z cache_alignment : 64 2025-05-07T20:23:04.9278094Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:04.9278411Z power management: 2025-05-07T20:23:04.9278552Z 2025-05-07T20:23:04.9278634Z processor : 7 2025-05-07T20:23:04.9278852Z vendor_id : AuthenticAMD 2025-05-07T20:23:04.9279085Z cpu family : 23 2025-05-07T20:23:04.9279289Z model : 49 2025-05-07T20:23:04.9279495Z model name : AMD EPYC 7R32 2025-05-07T20:23:04.9279732Z stepping : 0 2025-05-07T20:23:04.9279939Z microcode : 0x830107f 2025-05-07T20:23:04.9280171Z cpu MHz : 3305.439 2025-05-07T20:23:04.9280392Z cache size : 512 KB 2025-05-07T20:23:04.9280607Z physical id : 0 2025-05-07T20:23:04.9280818Z siblings : 16 2025-05-07T20:23:04.9281015Z core id : 7 2025-05-07T20:23:04.9281217Z cpu cores : 8 2025-05-07T20:23:04.9281423Z apicid : 14 2025-05-07T20:23:04.9281620Z initial apicid : 14 2025-05-07T20:23:04.9281837Z fpu : yes 2025-05-07T20:23:04.9282039Z fpu_exception : yes 2025-05-07T20:23:04.9282262Z cpuid level : 13 2025-05-07T20:23:04.9282463Z wp : yes 2025-05-07T20:23:04.9284494Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:04.9286808Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:04.9287314Z bogomips : 5600.00 2025-05-07T20:23:04.9287531Z TLB size : 3072 4K pages 2025-05-07T20:23:04.9287771Z clflush size : 64 2025-05-07T20:23:04.9287993Z cache_alignment : 64 2025-05-07T20:23:04.9288262Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:04.9288586Z power management: 2025-05-07T20:23:04.9288719Z 2025-05-07T20:23:04.9288815Z processor : 8 2025-05-07T20:23:04.9289027Z vendor_id : AuthenticAMD 2025-05-07T20:23:04.9289269Z cpu family : 23 2025-05-07T20:23:04.9289476Z model : 49 2025-05-07T20:23:04.9289686Z model name : AMD EPYC 7R32 2025-05-07T20:23:04.9289927Z stepping : 0 2025-05-07T20:23:04.9290138Z microcode : 0x830107f 2025-05-07T20:23:04.9290364Z cpu MHz : 3299.774 2025-05-07T20:23:04.9290572Z cache size : 512 KB 2025-05-07T20:23:04.9290788Z physical id : 0 2025-05-07T20:23:04.9291002Z siblings : 16 2025-05-07T20:23:04.9291193Z core id : 0 2025-05-07T20:23:04.9291387Z cpu cores : 8 2025-05-07T20:23:04.9291582Z apicid : 1 2025-05-07T20:23:04.9291865Z initial apicid : 1 2025-05-07T20:23:04.9292183Z fpu : yes 2025-05-07T20:23:04.9292376Z fpu_exception : yes 2025-05-07T20:23:04.9292594Z cpuid level : 13 2025-05-07T20:23:04.9292807Z wp : yes 2025-05-07T20:23:04.9294882Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:04.9297284Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:04.9297786Z bogomips : 5600.00 2025-05-07T20:23:04.9298021Z TLB size : 3072 4K pages 2025-05-07T20:23:04.9298264Z clflush size : 64 2025-05-07T20:23:04.9298479Z cache_alignment : 64 2025-05-07T20:23:04.9298747Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:04.9299081Z power management: 2025-05-07T20:23:04.9299215Z 2025-05-07T20:23:04.9299299Z processor : 9 2025-05-07T20:23:04.9299526Z vendor_id : AuthenticAMD 2025-05-07T20:23:04.9299773Z cpu family : 23 2025-05-07T20:23:04.9299980Z model : 49 2025-05-07T20:23:04.9300192Z model name : AMD EPYC 7R32 2025-05-07T20:23:04.9300436Z stepping : 0 2025-05-07T20:23:04.9300653Z microcode : 0x830107f 2025-05-07T20:23:04.9300877Z cpu MHz : 3299.361 2025-05-07T20:23:04.9301096Z cache size : 512 KB 2025-05-07T20:23:04.9301308Z physical id : 0 2025-05-07T20:23:04.9301514Z siblings : 16 2025-05-07T20:23:04.9301714Z core id : 1 2025-05-07T20:23:04.9301917Z cpu cores : 8 2025-05-07T20:23:04.9302114Z apicid : 3 2025-05-07T20:23:04.9302310Z initial apicid : 3 2025-05-07T20:23:04.9302519Z fpu : yes 2025-05-07T20:23:04.9302715Z fpu_exception : yes 2025-05-07T20:23:04.9302934Z cpuid level : 13 2025-05-07T20:23:04.9303137Z wp : yes 2025-05-07T20:23:04.9305225Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:04.9307986Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:04.9308695Z bogomips : 5600.00 2025-05-07T20:23:04.9309011Z TLB size : 3072 4K pages 2025-05-07T20:23:04.9309340Z clflush size : 64 2025-05-07T20:23:04.9309661Z cache_alignment : 64 2025-05-07T20:23:04.9310052Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:04.9310499Z power management: 2025-05-07T20:23:04.9310681Z 2025-05-07T20:23:04.9310794Z processor : 10 2025-05-07T20:23:04.9311089Z vendor_id : AuthenticAMD 2025-05-07T20:23:04.9311397Z cpu family : 23 2025-05-07T20:23:04.9311655Z model : 49 2025-05-07T20:23:04.9311924Z model name : AMD EPYC 7R32 2025-05-07T20:23:04.9312254Z stepping : 0 2025-05-07T20:23:04.9312528Z microcode : 0x830107f 2025-05-07T20:23:04.9312837Z cpu MHz : 3303.280 2025-05-07T20:23:04.9313139Z cache size : 512 KB 2025-05-07T20:23:04.9313433Z physical id : 0 2025-05-07T20:23:04.9313689Z siblings : 16 2025-05-07T20:23:04.9313893Z core id : 2 2025-05-07T20:23:04.9314084Z cpu cores : 8 2025-05-07T20:23:04.9314283Z apicid : 5 2025-05-07T20:23:04.9314487Z initial apicid : 5 2025-05-07T20:23:04.9314691Z fpu : yes 2025-05-07T20:23:04.9314891Z fpu_exception : yes 2025-05-07T20:23:04.9315112Z cpuid level : 13 2025-05-07T20:23:04.9315509Z wp : yes 2025-05-07T20:23:04.9317523Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:04.9319831Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:04.9320329Z bogomips : 5600.00 2025-05-07T20:23:04.9320676Z TLB size : 3072 4K pages 2025-05-07T20:23:04.9320912Z clflush size : 64 2025-05-07T20:23:04.9321130Z cache_alignment : 64 2025-05-07T20:23:04.9321401Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:04.9321720Z power management: 2025-05-07T20:23:04.9321857Z 2025-05-07T20:23:04.9321941Z processor : 11 2025-05-07T20:23:04.9322159Z vendor_id : AuthenticAMD 2025-05-07T20:23:04.9322390Z cpu family : 23 2025-05-07T20:23:04.9322595Z model : 49 2025-05-07T20:23:04.9322797Z model name : AMD EPYC 7R32 2025-05-07T20:23:04.9323030Z stepping : 0 2025-05-07T20:23:04.9323237Z microcode : 0x830107f 2025-05-07T20:23:04.9323497Z cpu MHz : 3298.440 2025-05-07T20:23:04.9323781Z cache size : 512 KB 2025-05-07T20:23:04.9324076Z physical id : 0 2025-05-07T20:23:04.9324375Z siblings : 16 2025-05-07T20:23:04.9324643Z core id : 3 2025-05-07T20:23:04.9324927Z cpu cores : 8 2025-05-07T20:23:04.9325220Z apicid : 7 2025-05-07T20:23:04.9325482Z initial apicid : 7 2025-05-07T20:23:04.9325772Z fpu : yes 2025-05-07T20:23:04.9326028Z fpu_exception : yes 2025-05-07T20:23:04.9326303Z cpuid level : 13 2025-05-07T20:23:04.9326578Z wp : yes 2025-05-07T20:23:04.9329017Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:04.9331343Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:04.9331838Z bogomips : 5600.00 2025-05-07T20:23:04.9332132Z TLB size : 3072 4K pages 2025-05-07T20:23:04.9332371Z clflush size : 64 2025-05-07T20:23:04.9332586Z cache_alignment : 64 2025-05-07T20:23:04.9332850Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:04.9333170Z power management: 2025-05-07T20:23:04.9333306Z 2025-05-07T20:23:04.9333395Z processor : 12 2025-05-07T20:23:04.9333607Z vendor_id : AuthenticAMD 2025-05-07T20:23:04.9333846Z cpu family : 23 2025-05-07T20:23:04.9334052Z model : 49 2025-05-07T20:23:04.9334260Z model name : AMD EPYC 7R32 2025-05-07T20:23:04.9334544Z stepping : 0 2025-05-07T20:23:04.9334758Z microcode : 0x830107f 2025-05-07T20:23:04.9334984Z cpu MHz : 3299.126 2025-05-07T20:23:04.9335193Z cache size : 512 KB 2025-05-07T20:23:04.9335407Z physical id : 0 2025-05-07T20:23:04.9335616Z siblings : 16 2025-05-07T20:23:04.9335809Z core id : 4 2025-05-07T20:23:04.9336007Z cpu cores : 8 2025-05-07T20:23:04.9336201Z apicid : 9 2025-05-07T20:23:04.9336391Z initial apicid : 9 2025-05-07T20:23:04.9336601Z fpu : yes 2025-05-07T20:23:04.9336800Z fpu_exception : yes 2025-05-07T20:23:04.9337014Z cpuid level : 13 2025-05-07T20:23:04.9337221Z wp : yes 2025-05-07T20:23:04.9339882Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:04.9343053Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:04.9343587Z bogomips : 5600.00 2025-05-07T20:23:04.9343809Z TLB size : 3072 4K pages 2025-05-07T20:23:04.9344048Z clflush size : 64 2025-05-07T20:23:04.9344275Z cache_alignment : 64 2025-05-07T20:23:04.9344680Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:04.9345023Z power management: 2025-05-07T20:23:04.9345158Z 2025-05-07T20:23:04.9345250Z processor : 13 2025-05-07T20:23:04.9345494Z vendor_id : AuthenticAMD 2025-05-07T20:23:04.9345741Z cpu family : 23 2025-05-07T20:23:04.9345957Z model : 49 2025-05-07T20:23:04.9346162Z model name : AMD EPYC 7R32 2025-05-07T20:23:04.9346434Z stepping : 0 2025-05-07T20:23:04.9346646Z microcode : 0x830107f 2025-05-07T20:23:04.9346875Z cpu MHz : 3292.969 2025-05-07T20:23:04.9347080Z cache size : 512 KB 2025-05-07T20:23:04.9347295Z physical id : 0 2025-05-07T20:23:04.9347502Z siblings : 16 2025-05-07T20:23:04.9347695Z core id : 5 2025-05-07T20:23:04.9347893Z cpu cores : 8 2025-05-07T20:23:04.9348097Z apicid : 11 2025-05-07T20:23:04.9348314Z initial apicid : 11 2025-05-07T20:23:04.9348536Z fpu : yes 2025-05-07T20:23:04.9348731Z fpu_exception : yes 2025-05-07T20:23:04.9348940Z cpuid level : 13 2025-05-07T20:23:04.9349153Z wp : yes 2025-05-07T20:23:04.9351499Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:04.9354860Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:04.9355579Z bogomips : 5600.00 2025-05-07T20:23:04.9355883Z TLB size : 3072 4K pages 2025-05-07T20:23:04.9356217Z clflush size : 64 2025-05-07T20:23:04.9356526Z cache_alignment : 64 2025-05-07T20:23:04.9356882Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:04.9357232Z power management: 2025-05-07T20:23:04.9357369Z 2025-05-07T20:23:04.9357462Z processor : 14 2025-05-07T20:23:04.9357679Z vendor_id : AuthenticAMD 2025-05-07T20:23:04.9357929Z cpu family : 23 2025-05-07T20:23:04.9358140Z model : 49 2025-05-07T20:23:04.9358344Z model name : AMD EPYC 7R32 2025-05-07T20:23:04.9358594Z stepping : 0 2025-05-07T20:23:04.9358803Z microcode : 0x830107f 2025-05-07T20:23:04.9359031Z cpu MHz : 3297.150 2025-05-07T20:23:04.9359250Z cache size : 512 KB 2025-05-07T20:23:04.9359471Z physical id : 0 2025-05-07T20:23:04.9359678Z siblings : 16 2025-05-07T20:23:04.9359884Z core id : 6 2025-05-07T20:23:04.9360083Z cpu cores : 8 2025-05-07T20:23:04.9360283Z apicid : 13 2025-05-07T20:23:04.9360488Z initial apicid : 13 2025-05-07T20:23:04.9360708Z fpu : yes 2025-05-07T20:23:04.9360904Z fpu_exception : yes 2025-05-07T20:23:04.9361127Z cpuid level : 13 2025-05-07T20:23:04.9361376Z wp : yes 2025-05-07T20:23:04.9363955Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:04.9366811Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:04.9367554Z bogomips : 5600.00 2025-05-07T20:23:04.9367875Z TLB size : 3072 4K pages 2025-05-07T20:23:04.9368208Z clflush size : 64 2025-05-07T20:23:04.9368514Z cache_alignment : 64 2025-05-07T20:23:04.9368909Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:04.9369386Z power management: 2025-05-07T20:23:04.9369577Z 2025-05-07T20:23:04.9369830Z processor : 15 2025-05-07T20:23:04.9370130Z vendor_id : AuthenticAMD 2025-05-07T20:23:04.9370444Z cpu family : 23 2025-05-07T20:23:04.9370720Z model : 49 2025-05-07T20:23:04.9371015Z model name : AMD EPYC 7R32 2025-05-07T20:23:04.9371360Z stepping : 0 2025-05-07T20:23:04.9371644Z microcode : 0x830107f 2025-05-07T20:23:04.9372016Z cpu MHz : 3313.445 2025-05-07T20:23:04.9372241Z cache size : 512 KB 2025-05-07T20:23:04.9372461Z physical id : 0 2025-05-07T20:23:04.9372667Z siblings : 16 2025-05-07T20:23:04.9372871Z core id : 7 2025-05-07T20:23:04.9373076Z cpu cores : 8 2025-05-07T20:23:04.9373276Z apicid : 15 2025-05-07T20:23:04.9373487Z initial apicid : 15 2025-05-07T20:23:04.9373711Z fpu : yes 2025-05-07T20:23:04.9373909Z fpu_exception : yes 2025-05-07T20:23:04.9374135Z cpuid level : 13 2025-05-07T20:23:04.9374348Z wp : yes 2025-05-07T20:23:04.9376743Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:04.9379568Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:04.9380143Z bogomips : 5600.00 2025-05-07T20:23:04.9380375Z TLB size : 3072 4K pages 2025-05-07T20:23:04.9380621Z clflush size : 64 2025-05-07T20:23:04.9380858Z cache_alignment : 64 2025-05-07T20:23:04.9381153Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:04.9381503Z power management: 2025-05-07T20:23:04.9381648Z 2025-05-07T20:23:04.9381652Z 2025-05-07T20:23:04.9381771Z ################################################################################ 2025-05-07T20:23:04.9382119Z [INFO] Print PCI info ... 2025-05-07T20:23:04.9382377Z + lspci -v 2025-05-07T20:23:04.9382500Z 2025-05-07T20:23:04.9382742Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:04.9383164Z Subsystem: Amazon.com, Inc. Device 1237 2025-05-07T20:23:04.9383518Z Flags: bus master, medium devsel, latency 0 2025-05-07T20:23:04.9383745Z 2025-05-07T20:23:04.9383962Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:04.9384382Z Physical Slot: 1 2025-05-07T20:23:04.9384629Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:04.9384857Z 2025-05-07T20:23:04.9385133Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:04.9385617Z Physical Slot: 1 2025-05-07T20:23:04.9385880Z Flags: bus master, fast devsel, latency 0, IRQ 9 2025-05-07T20:23:04.9386131Z 2025-05-07T20:23:04.9386426Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller]) 2025-05-07T20:23:04.9386924Z Physical Slot: 3 2025-05-07T20:23:04.9387168Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:04.9387645Z Memory at c1000000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:04.9388036Z Expansion ROM at 000c0000 [disabled] [size=128K] 2025-05-07T20:23:04.9388283Z 2025-05-07T20:23:04.9388626Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:04.9389230Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:04.9389540Z Physical Slot: 4 2025-05-07T20:23:04.9389809Z Flags: bus master, fast devsel, latency 0, IRQ 11 2025-05-07T20:23:04.9390215Z Memory at c1808000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:04.9390595Z Capabilities: 2025-05-07T20:23:04.9390878Z Kernel driver in use: nvme 2025-05-07T20:23:04.9391050Z 2025-05-07T20:23:04.9391386Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:04.9391912Z Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:04.9392297Z Physical Slot: 5 2025-05-07T20:23:04.9392557Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:04.9392936Z Memory at c1804000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:04.9393357Z Memory at c1400000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:04.9393709Z Capabilities: 2025-05-07T20:23:04.9393985Z Kernel driver in use: ena 2025-05-07T20:23:04.9394238Z Kernel modules: ena 2025-05-07T20:23:04.9394383Z 2025-05-07T20:23:04.9394571Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:04.9395033Z Subsystem: NVIDIA Corporation Device 152f 2025-05-07T20:23:04.9395344Z Physical Slot: 30 2025-05-07T20:23:04.9395619Z Flags: bus master, fast devsel, latency 0, IRQ 10 2025-05-07T20:23:04.9396029Z Memory at c0000000 (32-bit, non-prefetchable) [size=16M] 2025-05-07T20:23:04.9396451Z Memory at 1800000000 (64-bit, prefetchable) [size=32G] 2025-05-07T20:23:04.9396857Z Memory at 1040000000 (64-bit, prefetchable) [size=32M] 2025-05-07T20:23:04.9397218Z Capabilities: 2025-05-07T20:23:04.9397503Z Kernel driver in use: nvidia 2025-05-07T20:23:04.9397782Z Kernel modules: nvidia 2025-05-07T20:23:04.9397935Z 2025-05-07T20:23:04.9398287Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:04.9398860Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:04.9399165Z Physical Slot: 31 2025-05-07T20:23:04.9399421Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:04.9399812Z Memory at c1800000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:04.9400221Z Memory at c180c000 (32-bit, prefetchable) [size=8K] 2025-05-07T20:23:04.9400575Z Capabilities: 2025-05-07T20:23:04.9400861Z Kernel driver in use: nvme 2025-05-07T20:23:04.9401031Z 2025-05-07T20:23:04.9401035Z 2025-05-07T20:23:04.9401153Z ################################################################################ 2025-05-07T20:23:04.9401501Z [INFO] Print Linux distribution info ... 2025-05-07T20:23:04.9401810Z + uname -a 2025-05-07T20:23:04.9401928Z 2025-05-07T20:23:04.9402381Z Linux ip-10-0-64-8.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux 2025-05-07T20:23:04.9402946Z 2025-05-07T20:23:04.9403027Z + uname -m 2025-05-07T20:23:04.9403152Z 2025-05-07T20:23:04.9403226Z x86_64 2025-05-07T20:23:04.9403335Z 2025-05-07T20:23:04.9403429Z + cat /proc/version 2025-05-07T20:23:04.9403566Z 2025-05-07T20:23:04.9404182Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 2025-05-07T20:23:04.9404949Z 2025-05-07T20:23:04.9405038Z + cat /etc/os-release 2025-05-07T20:23:04.9405196Z 2025-05-07T20:23:04.9405292Z NAME="Amazon Linux" 2025-05-07T20:23:04.9405513Z VERSION="2023" 2025-05-07T20:23:04.9405717Z ID="amzn" 2025-05-07T20:23:04.9405913Z ID_LIKE="fedora" 2025-05-07T20:23:04.9406450Z VERSION_ID="2023" 2025-05-07T20:23:04.9406688Z PLATFORM_ID="platform:al2023" 2025-05-07T20:23:04.9406987Z PRETTY_NAME="Amazon Linux 2023.6.20250317" 2025-05-07T20:23:04.9407291Z ANSI_COLOR="0;33" 2025-05-07T20:23:04.9407550Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023" 2025-05-07T20:23:04.9407971Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/" 2025-05-07T20:23:04.9408446Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/" 2025-05-07T20:23:04.9408898Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/" 2025-05-07T20:23:04.9409378Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023" 2025-05-07T20:23:04.9409796Z VENDOR_NAME="AWS" 2025-05-07T20:23:04.9410046Z VENDOR_URL="https://aws.amazon.com/" 2025-05-07T20:23:04.9410351Z SUPPORT_END="2029-06-30" 2025-05-07T20:23:04.9410518Z 2025-05-07T20:23:04.9410784Z ################################################################################ 2025-05-07T20:23:04.9411119Z # Print EC2 Instance Info 2025-05-07T20:23:04.9411362Z # 2025-05-07T20:23:04.9411585Z # [2025-05-07T20:23:04.938Z] + print_ec2_info 2025-05-07T20:23:04.9411916Z ################################################################################ 2025-05-07T20:23:04.9412211Z 2025-05-07T20:23:04.9509537Z ami-id: ami-071226ecf16aa7d96 2025-05-07T20:23:04.9627864Z instance-id: i-0a11e2b4e0c9387f6 2025-05-07T20:23:04.9740546Z instance-type: g5.4xlarge 2025-05-07T20:23:04.9786203Z ##[group]Run . $PRELUDE; print_gpu_info 2025-05-07T20:23:04.9786573Z . $PRELUDE; print_gpu_info 2025-05-07T20:23:04.9796419Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:04.9796778Z env: 2025-05-07T20:23:04.9797008Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:04.9797312Z BUILD_ENV: build_binary 2025-05-07T20:23:04.9797563Z BUILD_TARGET: genai 2025-05-07T20:23:04.9797795Z BUILD_VARIANT: cuda 2025-05-07T20:23:04.9798027Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:04.9798287Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:04.9798591Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:04.9798936Z ##[endgroup] 2025-05-07T20:23:05.3170053Z ################################################################################ 2025-05-07T20:23:05.3170419Z [INFO] Printing general display info ... 2025-05-07T20:23:05.3186919Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:05.4387681Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:05.4398138Z /usr/bin/sudo 2025-05-07T20:23:05.4408942Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:05.4418829Z /usr/bin/yum 2025-05-07T20:23:05.4420530Z [INSTALL] Updating system repositories ... 2025-05-07T20:23:05.4442526Z [EXEC] [ATTEMPT 0/3] + sudo yum update -y 2025-05-07T20:23:05.8658866Z Last metadata expiration check: 0:00:06 ago on Wed May 7 20:22:59 2025. 2025-05-07T20:23:05.9479767Z ================================================================================ 2025-05-07T20:23:05.9480443Z WARNING: 2025-05-07T20:23:05.9480953Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:05.9481411Z 2025-05-07T20:23:05.9481606Z Available Versions: 2025-05-07T20:23:05.9481900Z 2025-05-07T20:23:05.9482082Z Version 2023.7.20250331: 2025-05-07T20:23:05.9482712Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:05.9483226Z 2025-05-07T20:23:05.9483497Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:05.9483922Z 2025-05-07T20:23:05.9484110Z Release notes: 2025-05-07T20:23:05.9484845Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:05.9485281Z 2025-05-07T20:23:05.9485375Z Version 2023.7.20250414: 2025-05-07T20:23:05.9485698Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:05.9485957Z 2025-05-07T20:23:05.9486078Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:05.9486299Z 2025-05-07T20:23:05.9486389Z Release notes: 2025-05-07T20:23:05.9487004Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:05.9487380Z 2025-05-07T20:23:05.9487481Z Version 2023.7.20250428: 2025-05-07T20:23:05.9487790Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:05.9488054Z 2025-05-07T20:23:05.9488175Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:05.9488394Z 2025-05-07T20:23:05.9488492Z Release notes: 2025-05-07T20:23:05.9488891Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:05.9489272Z 2025-05-07T20:23:05.9489390Z ================================================================================ 2025-05-07T20:23:06.0633004Z Dependencies resolved. 2025-05-07T20:23:06.0918193Z ================================================================================ 2025-05-07T20:23:06.0918625Z Package Arch Version Repository Size 2025-05-07T20:23:06.0919012Z ================================================================================ 2025-05-07T20:23:06.0919337Z Upgrading: 2025-05-07T20:23:06.0919707Z nvidia-container-toolkit x86_64 1.17.6-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:06.0920306Z nvidia-container-toolkit-base x86_64 1.17.6-1 nvidia-container-toolkit 5.7 M 2025-05-07T20:23:06.0920679Z 2025-05-07T20:23:06.0921000Z Transaction Summary 2025-05-07T20:23:06.0921268Z ================================================================================ 2025-05-07T20:23:06.0921594Z Upgrade 2 Packages 2025-05-07T20:23:06.0921732Z 2025-05-07T20:23:06.0921833Z Total download size: 6.9 M 2025-05-07T20:23:06.0922772Z Downloading Packages: 2025-05-07T20:23:06.1332617Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64 31 MB/s | 1.2 MB 00:00 2025-05-07T20:23:06.1799323Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x 66 MB/s | 5.7 MB 00:00 2025-05-07T20:23:06.1809079Z -------------------------------------------------------------------------------- 2025-05-07T20:23:06.1812060Z Total 78 MB/s | 6.9 MB 00:00 2025-05-07T20:23:06.1814464Z Running transaction check 2025-05-07T20:23:06.1909720Z Transaction check succeeded. 2025-05-07T20:23:06.1910358Z Running transaction test 2025-05-07T20:23:06.2204989Z Transaction test succeeded. 2025-05-07T20:23:06.2208252Z Running transaction 2025-05-07T20:23:06.7722200Z Preparing : 1/1 2025-05-07T20:23:06.8778220Z Upgrading : nvidia-container-toolkit-base-1.17.6-1.x86_64 1/4 2025-05-07T20:23:06.8801631Z Upgrading : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:06.9050753Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:06.9051573Z Cleanup : nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:06.9154075Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:06.9180717Z Cleanup : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:07.0994786Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 4/4 2025-05-07T20:23:07.0995395Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 1/4 2025-05-07T20:23:07.0995961Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:07.0996508Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 3/4 2025-05-07T20:23:07.2472421Z ================================================================================ 2025-05-07T20:23:07.2472797Z WARNING: 2025-05-07T20:23:07.2473047Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:07.2473282Z 2025-05-07T20:23:07.2473374Z Available Versions: 2025-05-07T20:23:07.2473529Z 2025-05-07T20:23:07.2473621Z Version 2023.7.20250331: 2025-05-07T20:23:07.2473945Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:07.2474495Z 2025-05-07T20:23:07.2474622Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:07.2474847Z 2025-05-07T20:23:07.2474938Z Release notes: 2025-05-07T20:23:07.2475364Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:07.2475750Z 2025-05-07T20:23:07.2475864Z Version 2023.7.20250414: 2025-05-07T20:23:07.2476182Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:07.2476462Z 2025-05-07T20:23:07.2476584Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:07.2476800Z 2025-05-07T20:23:07.2476896Z Release notes: 2025-05-07T20:23:07.2477300Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:07.2477681Z 2025-05-07T20:23:07.2477775Z Version 2023.7.20250428: 2025-05-07T20:23:07.2478095Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:07.2478358Z 2025-05-07T20:23:07.2478486Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:07.2478702Z 2025-05-07T20:23:07.2478794Z Release notes: 2025-05-07T20:23:07.2479206Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:07.2479584Z 2025-05-07T20:23:07.2479930Z ================================================================================ 2025-05-07T20:23:07.3055857Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:07.3056234Z 2025-05-07T20:23:07.3056324Z Upgraded: 2025-05-07T20:23:07.3056690Z nvidia-container-toolkit-1.17.6-1.x86_64 2025-05-07T20:23:07.3057290Z nvidia-container-toolkit-base-1.17.6-1.x86_64 2025-05-07T20:23:07.3057644Z 2025-05-07T20:23:07.3057742Z Complete! 2025-05-07T20:23:07.3491730Z [INSTALL] Installing system package(s): hostname lshw ... 2025-05-07T20:23:07.3515802Z [EXEC] [ATTEMPT 0/3] + sudo yum install -y hostname lshw 2025-05-07T20:23:07.7481990Z Last metadata expiration check: 0:00:08 ago on Wed May 7 20:22:59 2025. 2025-05-07T20:23:07.7723679Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed. 2025-05-07T20:23:07.8122920Z Dependencies resolved. 2025-05-07T20:23:07.8300100Z ================================================================================ 2025-05-07T20:23:07.8301041Z Package Architecture Version Repository Size 2025-05-07T20:23:07.8301887Z ================================================================================ 2025-05-07T20:23:07.8302477Z Installing: 2025-05-07T20:23:07.8303046Z lshw x86_64 B.02.19.2-7.amzn2023.0.3 amazonlinux 319 k 2025-05-07T20:23:07.8303601Z 2025-05-07T20:23:07.8303777Z Transaction Summary 2025-05-07T20:23:07.8304259Z ================================================================================ 2025-05-07T20:23:07.8304847Z Install 1 Package 2025-05-07T20:23:07.8305144Z 2025-05-07T20:23:07.8305266Z Total download size: 319 k 2025-05-07T20:23:07.8305518Z Installed size: 837 k 2025-05-07T20:23:07.8305756Z Downloading Packages: 2025-05-07T20:23:07.9777192Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm 2.6 MB/s | 319 kB 00:00 2025-05-07T20:23:07.9782969Z -------------------------------------------------------------------------------- 2025-05-07T20:23:07.9785681Z Total 2.1 MB/s | 319 kB 00:00 2025-05-07T20:23:07.9942933Z Running transaction check 2025-05-07T20:23:07.9997863Z Transaction check succeeded. 2025-05-07T20:23:07.9998246Z Running transaction test 2025-05-07T20:23:08.0447319Z Transaction test succeeded. 2025-05-07T20:23:08.0451050Z Running transaction 2025-05-07T20:23:08.1463798Z Preparing : 1/1 2025-05-07T20:23:08.1969137Z Installing : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:08.4019228Z Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:08.5292955Z ================================================================================ 2025-05-07T20:23:08.5293544Z WARNING: 2025-05-07T20:23:08.5293927Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:08.5294309Z 2025-05-07T20:23:08.5294452Z Available Versions: 2025-05-07T20:23:08.5294712Z 2025-05-07T20:23:08.5294849Z Version 2023.7.20250331: 2025-05-07T20:23:08.5295268Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:08.5295663Z 2025-05-07T20:23:08.5295829Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:08.5296157Z 2025-05-07T20:23:08.5296275Z Release notes: 2025-05-07T20:23:08.5296882Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:08.5297432Z 2025-05-07T20:23:08.5297557Z Version 2023.7.20250414: 2025-05-07T20:23:08.5298017Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:08.5298386Z 2025-05-07T20:23:08.5298563Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:08.5298860Z 2025-05-07T20:23:08.5298987Z Release notes: 2025-05-07T20:23:08.5299553Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:08.5300076Z 2025-05-07T20:23:08.5300526Z Version 2023.7.20250428: 2025-05-07T20:23:08.5300979Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:08.5301343Z 2025-05-07T20:23:08.5301504Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:08.5301824Z 2025-05-07T20:23:08.5301946Z Release notes: 2025-05-07T20:23:08.5302526Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:08.5303083Z 2025-05-07T20:23:08.5303249Z ================================================================================ 2025-05-07T20:23:08.5643342Z Verifying : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:08.5643873Z 2025-05-07T20:23:08.5644006Z Installed: 2025-05-07T20:23:08.5644429Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64 2025-05-07T20:23:08.5644861Z 2025-05-07T20:23:08.5644985Z Complete! 2025-05-07T20:23:08.6101673Z + hostname 2025-05-07T20:23:08.6101824Z 2025-05-07T20:23:08.6116272Z ip-10-0-64-8.ec2.internal 2025-05-07T20:23:08.6118118Z 2025-05-07T20:23:08.6118483Z + sudo lshw -C display 2025-05-07T20:23:08.6118653Z 2025-05-07T20:23:09.2732061Z *-display:0 UNCLAIMED 2025-05-07T20:23:09.2732399Z description: VGA compatible controller 2025-05-07T20:23:09.2732729Z product: Amazon.com, Inc. 2025-05-07T20:23:09.2733015Z vendor: Amazon.com, Inc. 2025-05-07T20:23:09.2733282Z physical id: 3 2025-05-07T20:23:09.2733524Z bus info: pci@0000:00:03.0 2025-05-07T20:23:09.2733781Z version: 00 2025-05-07T20:23:09.2733998Z width: 32 bits 2025-05-07T20:23:09.2734225Z clock: 33MHz 2025-05-07T20:23:09.2734501Z capabilities: vga_controller bus_master 2025-05-07T20:23:09.2734821Z configuration: latency=0 2025-05-07T20:23:09.2735154Z resources: memory:c1000000-c13fffff memory:c0000-dffff 2025-05-07T20:23:09.2735488Z *-display:1 2025-05-07T20:23:09.2735715Z description: 3D controller 2025-05-07T20:23:09.2735994Z product: GA102GL [A10G] 2025-05-07T20:23:09.2736269Z vendor: NVIDIA Corporation 2025-05-07T20:23:09.2736540Z physical id: 1e 2025-05-07T20:23:09.2736781Z bus info: pci@0000:00:1e.0 2025-05-07T20:23:09.2737031Z version: a1 2025-05-07T20:23:09.2737245Z width: 64 bits 2025-05-07T20:23:09.2737469Z clock: 33MHz 2025-05-07T20:23:09.2737756Z capabilities: pm pciexpress msix bus_master cap_list 2025-05-07T20:23:09.2738140Z configuration: driver=nvidia latency=0 2025-05-07T20:23:09.2738777Z resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff 2025-05-07T20:23:09.2770555Z 2025-05-07T20:23:09.2770770Z ################################################################################ 2025-05-07T20:23:09.2771095Z [INFO] Printing NVIDIA GPU info ... 2025-05-07T20:23:09.2900485Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:09.3084096Z Wed May 7 20:23:09 2025 2025-05-07T20:23:09.3084495Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:09.3085165Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:09.3085664Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:09.3086209Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:09.3086744Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:09.3087174Z | | | MIG M. | 2025-05-07T20:23:09.3087518Z |=========================================+========================+======================| 2025-05-07T20:23:09.3219168Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:09.3220000Z | 0% 28C P8 10W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:23:09.3220397Z | | | N/A | 2025-05-07T20:23:09.3220802Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:09.3223978Z 2025-05-07T20:23:09.3224564Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:09.3225096Z | Processes: | 2025-05-07T20:23:09.3225554Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:09.3225976Z | ID ID Usage | 2025-05-07T20:23:09.3226334Z |=========================================================================================| 2025-05-07T20:23:09.3229135Z | No running processes found | 2025-05-07T20:23:09.3229656Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:09.5902316Z ################################################################################ 2025-05-07T20:23:09.5902760Z [INFO] Printing AMD GPU info ... 2025-05-07T20:23:09.6044704Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:09.6045792Z [CHECK] rocminfo not found 2025-05-07T20:23:09.6055195Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:09.6056490Z [CHECK] rocm-smi not found 2025-05-07T20:23:09.6089318Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:09.6089752Z . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:09.6100842Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:09.6101195Z env: 2025-05-07T20:23:09.6101417Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:09.6101728Z BUILD_ENV: build_binary 2025-05-07T20:23:09.6101977Z BUILD_TARGET: genai 2025-05-07T20:23:09.6102209Z BUILD_VARIANT: cuda 2025-05-07T20:23:09.6102451Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:09.6102712Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:09.6103021Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:09.6103352Z ##[endgroup] 2025-05-07T20:23:09.9464096Z ################################################################################ 2025-05-07T20:23:09.9464443Z # Setup Miniconda 2025-05-07T20:23:09.9464947Z # 2025-05-07T20:23:09.9479148Z # [2025-05-07T20:23:09.947Z] + setup_miniconda /home/ec2-user/miniconda 2025-05-07T20:23:09.9479553Z ################################################################################ 2025-05-07T20:23:09.9479787Z 2025-05-07T20:23:09.9493937Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:10.0478620Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:10.0479002Z + mkdir -p /home/ec2-user/miniconda 2025-05-07T20:23:10.0479203Z 2025-05-07T20:23:10.0498732Z 2025-05-07T20:23:10.0499221Z [SETUP] Downloading the Miniconda installer ... 2025-05-07T20:23:10.0520774Z [EXEC] [ATTEMPT 0/3] + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh 2025-05-07T20:23:11.4728609Z [SETUP] Installing Miniconda ... 2025-05-07T20:23:11.4729162Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u 2025-05-07T20:23:11.4729480Z 2025-05-07T20:23:11.4873204Z PREFIX=/home/ec2-user/miniconda 2025-05-07T20:23:11.9324091Z Unpacking payload ... 2025-05-07T20:23:12.4512043Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:13.2518811Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:15.3522105Z 2025-05-07T20:23:15.3522627Z Installing base environment... 2025-05-07T20:23:15.3522850Z 2025-05-07T20:23:16.4194472Z Preparing transaction: ...working... done 2025-05-07T20:23:19.2928331Z Executing transaction: ...working... done 2025-05-07T20:23:19.9506922Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:20.0391164Z installation finished. 2025-05-07T20:23:20.0397931Z 2025-05-07T20:23:20.0398391Z + rm -f miniconda.sh 2025-05-07T20:23:20.0398598Z 2025-05-07T20:23:20.0714519Z 2025-05-07T20:23:20.0714888Z [SETUP] Reloading the bash configuration ... 2025-05-07T20:23:20.0715250Z + /home/ec2-user/miniconda/bin/conda init bash 2025-05-07T20:23:20.0715473Z 2025-05-07T20:23:20.4362575Z no change /home/ec2-user/miniconda/condabin/conda 2025-05-07T20:23:20.4362965Z no change /home/ec2-user/miniconda/bin/conda 2025-05-07T20:23:20.4363314Z no change /home/ec2-user/miniconda/bin/conda-env 2025-05-07T20:23:20.4363671Z no change /home/ec2-user/miniconda/bin/activate 2025-05-07T20:23:20.4364028Z no change /home/ec2-user/miniconda/bin/deactivate 2025-05-07T20:23:20.4364423Z no change /home/ec2-user/miniconda/etc/profile.d/conda.sh 2025-05-07T20:23:20.4364851Z no change /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish 2025-05-07T20:23:20.4365295Z no change /home/ec2-user/miniconda/shell/condabin/Conda.psm1 2025-05-07T20:23:20.4365771Z no change /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1 2025-05-07T20:23:20.4366537Z no change /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh 2025-05-07T20:23:20.4367075Z no change /home/ec2-user/miniconda/etc/profile.d/conda.csh 2025-05-07T20:23:20.4367442Z modified /home/ec2-user/.bashrc 2025-05-07T20:23:20.4367635Z 2025-05-07T20:23:20.4367843Z ==> For changes to take effect, close and re-open your current shell. <== 2025-05-07T20:23:20.4368149Z 2025-05-07T20:23:20.5019947Z 2025-05-07T20:23:20.5020536Z + . /home/ec2-user/.bashrc 2025-05-07T20:23:20.5020750Z 2025-05-07T20:23:21.3303233Z 2025-05-07T20:23:21.3303816Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ... 2025-05-07T20:23:21.3327224Z [EXEC] [ATTEMPT 0/3] + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive 2025-05-07T20:23:34.4748981Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:23:36.0540522Z Solving environment: / - \ | / - \ | / - \ | done 2025-05-07T20:23:36.1497937Z 2025-05-07T20:23:36.1498085Z ## Package Plan ## 2025-05-07T20:23:36.1498245Z 2025-05-07T20:23:36.1498979Z environment location: /home/ec2-user/miniconda 2025-05-07T20:23:36.1499242Z 2025-05-07T20:23:36.1499344Z added / updated specs: 2025-05-07T20:23:36.1499624Z - conda-libmamba-solver 2025-05-07T20:23:36.1499893Z - libarchive 2025-05-07T20:23:36.1500109Z - libmamba 2025-05-07T20:23:36.1500327Z - libmambapy 2025-05-07T20:23:36.1500477Z 2025-05-07T20:23:36.1500482Z 2025-05-07T20:23:36.1500627Z The following packages will be downloaded: 2025-05-07T20:23:36.1500848Z 2025-05-07T20:23:36.1500978Z package | build 2025-05-07T20:23:36.1501299Z ---------------------------|----------------- 2025-05-07T20:23:36.1501723Z ca-certificates-2025.4.26 | hbd8a1cb_0 149 KB conda-forge 2025-05-07T20:23:36.1502217Z certifi-2025.4.26 | pyhd8ed1ab_0 154 KB conda-forge 2025-05-07T20:23:36.1502669Z conda-25.3.1 | py313h78bf25f_1 1.1 MB conda-forge 2025-05-07T20:23:36.1503158Z conda-libmamba-solver-25.4.0| pyhd8ed1ab_0 41 KB conda-forge 2025-05-07T20:23:36.1503616Z ------------------------------------------------------------ 2025-05-07T20:23:36.1503965Z Total: 1.4 MB 2025-05-07T20:23:36.1504175Z 2025-05-07T20:23:36.1504288Z The following packages will be UPDATED: 2025-05-07T20:23:36.1504507Z 2025-05-07T20:23:36.1508348Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:23:36.1509311Z conda pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 2025-05-07T20:23:36.1509704Z 2025-05-07T20:23:36.1509940Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:23:36.1510266Z 2025-05-07T20:23:36.1510604Z certifi pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 2025-05-07T20:23:36.1511425Z conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 2025-05-07T20:23:36.1511931Z 2025-05-07T20:23:36.1511935Z 2025-05-07T20:23:36.1511940Z 2025-05-07T20:23:36.1512084Z Downloading and Extracting Packages: ...working... 2025-05-07T20:23:36.1512461Z conda-25.3.1 | 1.1 MB | | 0% 2025-05-07T20:23:36.1512691Z 2025-05-07T20:23:36.1513266Z certifi-2025.4.26 | 154 KB | | 0%  2025-05-07T20:23:36.1513516Z 2025-05-07T20:23:36.1513520Z 2025-05-07T20:23:36.1527818Z ca-certificates-2025 | 149 KB | | 0%  2025-05-07T20:23:36.1528188Z 2025-05-07T20:23:36.1528192Z 2025-05-07T20:23:36.1530201Z 2025-05-07T20:23:36.2066923Z conda-libmamba-solve | 41 KB | | 0%  2025-05-07T20:23:36.2067363Z 2025-05-07T20:23:36.2067371Z 2025-05-07T20:23:36.2068160Z 2025-05-07T20:23:36.2177378Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:23:36.2178034Z 2025-05-07T20:23:36.2185436Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:23:36.2185685Z 2025-05-07T20:23:36.2185791Z 2025-05-07T20:23:36.2185894Z 2025-05-07T20:23:36.2239718Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:23:36.2239988Z 2025-05-07T20:23:36.2240120Z 2025-05-07T20:23:36.2343594Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:23:36.2344171Z 2025-05-07T20:23:36.2363439Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:23:36.2363694Z 2025-05-07T20:23:36.2363699Z 2025-05-07T20:23:36.3510123Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:23:36.3910651Z conda-25.3.1 | 1.1 MB | 1 | 1% 2025-05-07T20:23:36.4989056Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:23:36.4989470Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:23:36.4994910Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:23:36.4995247Z 2025-05-07T20:23:36.4995456Z 2025-05-07T20:23:36.4995775Z  2025-05-07T20:23:36.4995981Z 2025-05-07T20:23:36.4995985Z 2025-05-07T20:23:36.4996160Z  2025-05-07T20:23:36.4996369Z 2025-05-07T20:23:36.4996399Z 2025-05-07T20:23:36.4996403Z 2025-05-07T20:23:36.4996600Z  done 2025-05-07T20:23:36.6000173Z Preparing transaction: - done 2025-05-07T20:23:36.7002711Z Verifying transaction: | done 2025-05-07T20:23:38.0021936Z Executing transaction: - \ | / - \ | / - \ | / - done 2025-05-07T20:23:39.7173193Z [SETUP] Updating Miniconda base packages ... 2025-05-07T20:23:39.7198512Z [EXEC] [ATTEMPT 0/3] + conda update -n base -c defaults --update-deps -y conda 2025-05-07T20:23:40.6518261Z Channels: 2025-05-07T20:23:40.6518510Z - defaults 2025-05-07T20:23:40.6518730Z Platform: linux-64 2025-05-07T20:23:41.8739599Z Collecting package metadata (repodata.json): - \ | / - \ | done 2025-05-07T20:23:41.9937039Z Solving environment: - \ Channels: 2025-05-07T20:23:41.9937486Z - defaults 2025-05-07T20:23:41.9937718Z Platform: linux-64 2025-05-07T20:23:42.2834572Z Collecting package metadata (repodata.json): / - \ done 2025-05-07T20:23:42.4939455Z Solving environment: / - \ | done 2025-05-07T20:23:42.5815625Z done 2025-05-07T20:23:42.6470811Z 2025-05-07T20:23:42.6470954Z ## Package Plan ## 2025-05-07T20:23:42.6471123Z 2025-05-07T20:23:42.6471293Z environment location: /home/ec2-user/miniconda 2025-05-07T20:23:42.6471535Z 2025-05-07T20:23:42.6471639Z added / updated specs: 2025-05-07T20:23:42.6471899Z - conda 2025-05-07T20:23:42.6472023Z 2025-05-07T20:23:42.6472028Z 2025-05-07T20:23:42.6472160Z The following packages will be downloaded: 2025-05-07T20:23:42.6472384Z 2025-05-07T20:23:42.6472507Z package | build 2025-05-07T20:23:42.6472852Z ---------------------------|----------------- 2025-05-07T20:23:42.6473212Z pip-25.1 | pyhc872135_2 1.3 MB 2025-05-07T20:23:42.6473604Z tzdata-2025b | h04d1e81_0 116 KB 2025-05-07T20:23:42.6473982Z ------------------------------------------------------------ 2025-05-07T20:23:42.6474591Z Total: 1.4 MB 2025-05-07T20:23:42.6474811Z 2025-05-07T20:23:42.6474936Z The following packages will be UPDATED: 2025-05-07T20:23:42.6475149Z 2025-05-07T20:23:42.6475464Z pip pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:23:42.6476005Z tzdata 2025a-h04d1e81_0 --> 2025b-h04d1e81_0 2025-05-07T20:23:42.6476288Z 2025-05-07T20:23:42.6476291Z 2025-05-07T20:23:42.6476295Z 2025-05-07T20:23:42.6476442Z Downloading and Extracting Packages: ...working... 2025-05-07T20:23:42.6476821Z pip-25.1 | 1.3 MB | | 0% 2025-05-07T20:23:42.6477039Z 2025-05-07T20:23:42.6844328Z tzdata-2025b | 116 KB | | 0%  2025-05-07T20:23:42.6845564Z 2025-05-07T20:23:42.7193436Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:23:42.8982017Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:23:42.8982602Z 2025-05-07T20:23:42.8985047Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:23:42.8985292Z 2025-05-07T20:23:42.9111862Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:23:42.9112274Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:23:42.9115923Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:23:42.9116246Z 2025-05-07T20:23:42.9116442Z 2025-05-07T20:23:42.9116838Z  done 2025-05-07T20:23:43.0121223Z Preparing transaction: - done 2025-05-07T20:23:43.1127660Z Verifying transaction: | done 2025-05-07T20:23:45.2154196Z Executing transaction: - \ | / - \ | / - \ | / - \ | / - \ | / - done 2025-05-07T20:23:45.8382925Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:23:45.8386833Z + conda clean --packages --tarball -y 2025-05-07T20:23:45.8387135Z 2025-05-07T20:23:46.8403730Z Will remove 99 (117.8 MB) tarball(s). 2025-05-07T20:23:46.8404078Z Will remove 11 (16.0 MB) package(s). 2025-05-07T20:23:46.9080864Z 2025-05-07T20:23:46.9088187Z + conda clean --all -y 2025-05-07T20:23:46.9088395Z 2025-05-07T20:23:47.4468569Z There are no unused tarball(s) to remove. 2025-05-07T20:23:47.4469262Z Will remove 1 index cache(s). 2025-05-07T20:23:47.4469821Z There are no unused package(s) to remove. 2025-05-07T20:23:47.4470431Z There are no tempfile(s) to remove. 2025-05-07T20:23:47.4470999Z There are no logfile(s) to remove. 2025-05-07T20:23:47.5091432Z 2025-05-07T20:23:47.5096701Z + conda info 2025-05-07T20:23:47.5096961Z 2025-05-07T20:23:48.2607221Z 2025-05-07T20:23:48.2607853Z active environment : base 2025-05-07T20:23:48.2608202Z active env location : /home/ec2-user/miniconda 2025-05-07T20:23:48.2608528Z shell level : 1 2025-05-07T20:23:48.2608801Z user config file : /home/ec2-user/.condarc 2025-05-07T20:23:48.2609213Z populated config files : /home/ec2-user/miniconda/.condarc 2025-05-07T20:23:48.2609588Z conda version : 25.3.1 2025-05-07T20:23:48.2609861Z conda-build version : not installed 2025-05-07T20:23:48.2610153Z python version : 3.13.2.final.0 2025-05-07T20:23:48.2610449Z solver : libmamba (default) 2025-05-07T20:23:48.2610751Z virtual packages : __archspec=1=zen2 2025-05-07T20:23:48.2611047Z __conda=25.3.1=0 2025-05-07T20:23:48.2611317Z __cuda=12.8=0 2025-05-07T20:23:48.2611588Z __glibc=2.34=0 2025-05-07T20:23:48.2611868Z __linux=6.1.130=0 2025-05-07T20:23:48.2612195Z __unix=0=0 2025-05-07T20:23:48.2612528Z base environment : /home/ec2-user/miniconda (writable) 2025-05-07T20:23:48.2612939Z conda av data dir : /home/ec2-user/miniconda/etc/conda 2025-05-07T20:23:48.2613287Z conda av metadata url : None 2025-05-07T20:23:48.2613987Z channel URLs : https://repo.anaconda.com/pkgs/main/linux-64 2025-05-07T20:23:48.2614424Z https://repo.anaconda.com/pkgs/main/noarch 2025-05-07T20:23:48.2614808Z https://repo.anaconda.com/pkgs/r/linux-64 2025-05-07T20:23:48.2615192Z https://repo.anaconda.com/pkgs/r/noarch 2025-05-07T20:23:48.2615567Z package cache : /home/ec2-user/miniconda/pkgs 2025-05-07T20:23:48.2615911Z /home/ec2-user/.conda/pkgs 2025-05-07T20:23:48.2616250Z envs directories : /home/ec2-user/miniconda/envs 2025-05-07T20:23:48.2616594Z /home/ec2-user/.conda/envs 2025-05-07T20:23:48.2616901Z platform : linux-64 2025-05-07T20:23:48.2617743Z user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/. 2025-05-07T20:23:48.2618741Z UID:GID : 1000:1000 2025-05-07T20:23:48.2619027Z netrc file : None 2025-05-07T20:23:48.2619291Z offline mode : False 2025-05-07T20:23:48.2619460Z 2025-05-07T20:23:48.3265584Z 2025-05-07T20:23:48.3266094Z [SETUP] Exporting Miniconda variables ... 2025-05-07T20:23:48.3266872Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_253e7459-3102-448d-a886-17ea95ebc735 ... 2025-05-07T20:23:48.3267676Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda 2025-05-07T20:23:48.3354036Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.12 2025-05-07T20:23:48.3354526Z . $PRELUDE; create_conda_environment $BUILD_ENV 3.12 2025-05-07T20:23:48.3373611Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:48.3373962Z env: 2025-05-07T20:23:48.3374188Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:48.3374509Z BUILD_ENV: build_binary 2025-05-07T20:23:48.3374758Z BUILD_TARGET: genai 2025-05-07T20:23:48.3374992Z BUILD_VARIANT: cuda 2025-05-07T20:23:48.3375223Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:48.3375479Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:48.3375781Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:48.3376111Z ##[endgroup] 2025-05-07T20:23:48.6746163Z ################################################################################ 2025-05-07T20:23:48.6746553Z # Create Conda Environment 2025-05-07T20:23:48.6746798Z # 2025-05-07T20:23:48.6761393Z # [2025-05-07T20:23:48.675Z] + create_conda_environment build_binary 3.12 2025-05-07T20:23:48.6761815Z ################################################################################ 2025-05-07T20:23:48.6762033Z 2025-05-07T20:23:48.6778051Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:48.7649440Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:48.7649848Z [SETUP] Listing existing Conda environments ... 2025-05-07T20:23:48.7650198Z + conda info --envs 2025-05-07T20:23:48.7650351Z 2025-05-07T20:23:49.5163162Z 2025-05-07T20:23:49.5163813Z # conda environments: 2025-05-07T20:23:49.5164092Z # 2025-05-07T20:23:49.5164310Z base /home/ec2-user/miniconda 2025-05-07T20:23:49.5164538Z 2025-05-07T20:23:49.5821643Z 2025-05-07T20:23:49.5822436Z [SETUP] Deleting the prefix directory if it exists ... 2025-05-07T20:23:51.2092965Z + rm -rf /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:23:51.2093270Z 2025-05-07T20:23:51.2105787Z 2025-05-07T20:23:51.2116079Z [SETUP] Creating new Conda environment (Python 3.12) ... 2025-05-07T20:23:51.2139396Z [EXEC] [ATTEMPT 0/3] + conda create -y -n build_binary python=3.12 2025-05-07T20:23:51.9656888Z Channels: 2025-05-07T20:23:51.9657143Z - defaults 2025-05-07T20:23:51.9657359Z Platform: linux-64 2025-05-07T20:23:53.5321446Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ done 2025-05-07T20:23:53.6580788Z Solving environment: / done 2025-05-07T20:23:53.6867961Z 2025-05-07T20:23:53.6868285Z ## Package Plan ## 2025-05-07T20:23:53.6868491Z 2025-05-07T20:23:53.6868731Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:23:53.6869183Z 2025-05-07T20:23:53.6869318Z added / updated specs: 2025-05-07T20:23:53.6869610Z - python=3.12 2025-05-07T20:23:53.6869744Z 2025-05-07T20:23:53.6869748Z 2025-05-07T20:23:53.6869894Z The following packages will be downloaded: 2025-05-07T20:23:53.6870112Z 2025-05-07T20:23:53.6870259Z package | build 2025-05-07T20:23:53.6870579Z ---------------------------|----------------- 2025-05-07T20:23:53.6870941Z _libgcc_mutex-0.1 | main 3 KB 2025-05-07T20:23:53.6871348Z _openmp_mutex-5.1 | 1_gnu 21 KB 2025-05-07T20:23:53.6871875Z ca-certificates-2025.2.25 | h06a4308_0 129 KB 2025-05-07T20:23:53.6872771Z python-3.12.9 | h5148396_0 34.7 MB 2025-05-07T20:23:53.6873181Z setuptools-78.1.1 | py312h06a4308_0 2.2 MB 2025-05-07T20:23:53.6873569Z wheel-0.45.1 | py312h06a4308_0 147 KB 2025-05-07T20:23:53.6873944Z ------------------------------------------------------------ 2025-05-07T20:23:53.6874287Z Total: 37.2 MB 2025-05-07T20:23:53.6874495Z 2025-05-07T20:23:53.6874631Z The following NEW packages will be INSTALLED: 2025-05-07T20:23:53.6874855Z 2025-05-07T20:23:53.6875279Z _libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main 2025-05-07T20:23:53.6875737Z _openmp_mutex pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 2025-05-07T20:23:53.6884697Z bzip2 pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 2025-05-07T20:23:53.6885363Z ca-certificates pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 2025-05-07T20:23:53.6885859Z expat pkgs/main/linux-64::expat-2.7.1-h6a678d5_0 2025-05-07T20:23:53.6886327Z ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 2025-05-07T20:23:53.6886790Z libffi pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 2025-05-07T20:23:53.6887225Z libgcc-ng pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 2025-05-07T20:23:53.6887665Z libgomp pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 2025-05-07T20:23:53.6888130Z libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 2025-05-07T20:23:53.6888588Z libuuid pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 2025-05-07T20:23:53.6889011Z ncurses pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 2025-05-07T20:23:53.6889433Z openssl pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 2025-05-07T20:23:53.6889834Z pip pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:23:53.6890240Z python pkgs/main/linux-64::python-3.12.9-h5148396_0 2025-05-07T20:23:53.6890678Z readline pkgs/main/linux-64::readline-8.2-h5eee18b_0 2025-05-07T20:23:53.6891154Z setuptools pkgs/main/linux-64::setuptools-78.1.1-py312h06a4308_0 2025-05-07T20:23:53.6891618Z sqlite pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 2025-05-07T20:23:53.6892132Z tk pkgs/main/linux-64::tk-8.6.14-h39e8969_0 2025-05-07T20:23:53.6892523Z tzdata pkgs/main/noarch::tzdata-2025b-h04d1e81_0 2025-05-07T20:23:53.6893016Z wheel pkgs/main/linux-64::wheel-0.45.1-py312h06a4308_0 2025-05-07T20:23:53.6893542Z xz pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 2025-05-07T20:23:53.6893949Z zlib pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 2025-05-07T20:23:53.6894191Z 2025-05-07T20:23:53.6894196Z 2025-05-07T20:23:53.6894211Z 2025-05-07T20:23:53.6894355Z Downloading and Extracting Packages: ...working... 2025-05-07T20:23:53.6894745Z python-3.12.9 | 34.7 MB | | 0% 2025-05-07T20:23:53.6894982Z 2025-05-07T20:23:53.6895319Z setuptools-78.1.1 | 2.2 MB | | 0%  2025-05-07T20:23:53.6895573Z 2025-05-07T20:23:53.6895577Z 2025-05-07T20:23:53.6895774Z wheel-0.45.1 | 147 KB | | 0%  2025-05-07T20:23:53.6896013Z 2025-05-07T20:23:53.6896017Z 2025-05-07T20:23:53.6896021Z 2025-05-07T20:23:53.6920066Z ca-certificates-2025 | 129 KB | | 0%  2025-05-07T20:23:53.6920382Z 2025-05-07T20:23:53.6920395Z 2025-05-07T20:23:53.6920398Z 2025-05-07T20:23:53.6920402Z 2025-05-07T20:23:53.6934999Z _openmp_mutex-5.1 | 21 KB | | 0%  2025-05-07T20:23:53.6935293Z 2025-05-07T20:23:53.6935297Z 2025-05-07T20:23:53.6935312Z 2025-05-07T20:23:53.6935316Z 2025-05-07T20:23:53.6935324Z 2025-05-07T20:23:53.7331730Z _libgcc_mutex-0.1 | 3 KB | | 0%  2025-05-07T20:23:53.7332074Z 2025-05-07T20:23:53.7332078Z 2025-05-07T20:23:53.7334060Z 2025-05-07T20:23:53.7510164Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:23:53.7510694Z 2025-05-07T20:23:53.7515649Z 2025-05-07T20:23:53.7645615Z wheel-0.45.1 | 147 KB | ########## | 100%  2025-05-07T20:23:53.7645874Z 2025-05-07T20:23:53.7645878Z 2025-05-07T20:23:53.7645882Z 2025-05-07T20:23:53.7645885Z 2025-05-07T20:23:53.7650285Z 2025-05-07T20:23:53.7774160Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:23:53.7774438Z 2025-05-07T20:23:53.7774442Z 2025-05-07T20:23:53.7775941Z 2025-05-07T20:23:53.7874066Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:23:53.7874858Z python-3.12.9 | 34.7 MB | 6 | 6% 2025-05-07T20:23:53.7875776Z 2025-05-07T20:23:53.7902816Z setuptools-78.1.1 | 2.2 MB | #######3 | 73%  2025-05-07T20:23:53.7903089Z 2025-05-07T20:23:53.7903093Z 2025-05-07T20:23:53.7903096Z 2025-05-07T20:23:53.7903100Z 2025-05-07T20:23:53.7903656Z 2025-05-07T20:23:53.7929799Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:23:53.7930164Z 2025-05-07T20:23:53.7930169Z 2025-05-07T20:23:53.7930173Z 2025-05-07T20:23:53.7930176Z 2025-05-07T20:23:53.7963812Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:23:53.7964095Z 2025-05-07T20:23:53.7964099Z 2025-05-07T20:23:53.7964103Z 2025-05-07T20:23:53.7964107Z 2025-05-07T20:23:53.8396072Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:23:53.8397525Z 2025-05-07T20:23:53.8878892Z setuptools-78.1.1 | 2.2 MB | ########## | 100%  2025-05-07T20:23:53.8935509Z python-3.12.9 | 34.7 MB | ##1 | 21% 2025-05-07T20:23:53.8935781Z 2025-05-07T20:23:53.8935787Z 2025-05-07T20:23:53.8935792Z 2025-05-07T20:23:53.8936610Z 2025-05-07T20:23:53.9073506Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:23:53.9073785Z 2025-05-07T20:23:53.9074024Z 2025-05-07T20:23:53.9080175Z wheel-0.45.1 | 147 KB | ########## | 100%  2025-05-07T20:23:53.9080433Z 2025-05-07T20:23:53.9080761Z 2025-05-07T20:23:53.9880492Z wheel-0.45.1 | 147 KB | ########## | 100%  2025-05-07T20:23:54.1447033Z python-3.12.9 | 34.7 MB | ######2 | 62% 2025-05-07T20:23:54.1450505Z python-3.12.9 | 34.7 MB | ########## | 100% 2025-05-07T20:23:54.2189616Z python-3.12.9 | 34.7 MB | ########## | 100% 2025-05-07T20:23:54.2189962Z 2025-05-07T20:23:54.8014434Z setuptools-78.1.1 | 2.2 MB | ########## | 100%  2025-05-07T20:23:54.8021400Z python-3.12.9 | 34.7 MB | ########## | 100% 2025-05-07T20:23:54.8021770Z 2025-05-07T20:23:54.8021983Z 2025-05-07T20:23:54.8022180Z  2025-05-07T20:23:54.8022392Z 2025-05-07T20:23:54.8022396Z 2025-05-07T20:23:54.8022560Z  2025-05-07T20:23:54.8022767Z 2025-05-07T20:23:54.8022771Z 2025-05-07T20:23:54.8022788Z 2025-05-07T20:23:54.8022965Z  2025-05-07T20:23:54.8023170Z 2025-05-07T20:23:54.8023174Z 2025-05-07T20:23:54.8023178Z 2025-05-07T20:23:54.8023181Z 2025-05-07T20:23:54.8023363Z  2025-05-07T20:23:54.8023573Z 2025-05-07T20:23:54.8023577Z 2025-05-07T20:23:54.8023580Z 2025-05-07T20:23:54.8023584Z 2025-05-07T20:23:54.8023587Z 2025-05-07T20:23:54.8023799Z  done 2025-05-07T20:23:55.0130064Z Preparing transaction: \ | done 2025-05-07T20:23:56.4296371Z Verifying transaction: - \ | / - \ | / - \ | / - done 2025-05-07T20:23:58.7442947Z Executing transaction: | / - \ | / - \ | / - \ | / - \ | / - \ | / - done 2025-05-07T20:23:58.7951473Z # 2025-05-07T20:23:58.7951736Z # To activate this environment, use 2025-05-07T20:23:58.7952313Z # 2025-05-07T20:23:58.7952527Z # $ conda activate build_binary 2025-05-07T20:23:58.7952799Z # 2025-05-07T20:23:58.7953011Z # To deactivate an active environment, use 2025-05-07T20:23:58.7953305Z # 2025-05-07T20:23:58.7953497Z # $ conda deactivate 2025-05-07T20:23:58.7953650Z 2025-05-07T20:23:58.9018053Z [SETUP] Upgrading PIP to latest ... 2025-05-07T20:23:58.9039846Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --upgrade pip 2025-05-07T20:24:01.8772537Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (25.1) 2025-05-07T20:24:01.8773917Z Collecting pip 2025-05-07T20:24:01.8774386Z Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB) 2025-05-07T20:24:01.8775001Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB) 2025-05-07T20:24:01.8777786Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 102.2 MB/s eta 0:00:00 2025-05-07T20:24:01.8778162Z Installing collected packages: pip 2025-05-07T20:24:01.8778485Z Attempting uninstall: pip 2025-05-07T20:24:01.8778771Z Found existing installation: pip 25.1 2025-05-07T20:24:01.8779088Z Uninstalling pip-25.1: 2025-05-07T20:24:01.8779370Z Successfully uninstalled pip-25.1 2025-05-07T20:24:01.8779680Z Successfully installed pip-25.1.1 2025-05-07T20:24:01.8779878Z 2025-05-07T20:24:01.9427662Z [SETUP] Upgrading pyOpenSSL ... 2025-05-07T20:24:01.9450589Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0 2025-05-07T20:24:02.7967786Z Channels: 2025-05-07T20:24:02.7968050Z - conda-forge 2025-05-07T20:24:02.7968303Z Platform: linux-64 2025-05-07T20:24:13.2466161Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:24:14.9612698Z Solving environment: / - \ | / done 2025-05-07T20:24:15.0227407Z 2025-05-07T20:24:15.0227702Z ## Package Plan ## 2025-05-07T20:24:15.0227943Z 2025-05-07T20:24:15.0228196Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:15.0228541Z 2025-05-07T20:24:15.0228640Z added / updated specs: 2025-05-07T20:24:15.0228917Z - pyopenssl[version='>22.1.0'] 2025-05-07T20:24:15.0229106Z 2025-05-07T20:24:15.0229110Z 2025-05-07T20:24:15.0229239Z The following packages will be downloaded: 2025-05-07T20:24:15.0229457Z 2025-05-07T20:24:15.0229581Z package | build 2025-05-07T20:24:15.0229906Z ---------------------------|----------------- 2025-05-07T20:24:15.0230317Z cffi-1.17.1 | py312h06ac9bb_0 288 KB conda-forge 2025-05-07T20:24:15.0230887Z cryptography-44.0.3 | py312hda17c39_0 1.5 MB conda-forge 2025-05-07T20:24:15.0231511Z expat-2.7.0 | h5888daf_0 137 KB conda-forge 2025-05-07T20:24:15.0232061Z libexpat-2.7.0 | h5888daf_0 73 KB conda-forge 2025-05-07T20:24:15.0232628Z libgcc-15.1.0 | h767d61c_2 810 KB conda-forge 2025-05-07T20:24:15.0233071Z libgcc-ng-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:24:15.0233488Z libgomp-15.1.0 | h767d61c_2 442 KB conda-forge 2025-05-07T20:24:15.0233901Z libnsl-2.0.1 | hd590300_0 33 KB conda-forge 2025-05-07T20:24:15.0234319Z libsqlite-3.46.0 | hde9e2c9_0 845 KB conda-forge 2025-05-07T20:24:15.0234743Z libuuid-2.38.1 | h0b41bf4_0 33 KB conda-forge 2025-05-07T20:24:15.0235194Z libxcrypt-4.4.36 | hd590300_1 98 KB conda-forge 2025-05-07T20:24:15.0235643Z libzlib-1.2.13 | h4ab18f5_6 60 KB conda-forge 2025-05-07T20:24:15.0236066Z openssl-3.5.0 | h7b32b05_1 3.0 MB conda-forge 2025-05-07T20:24:15.0236491Z pycparser-2.22 | pyh29332c3_1 108 KB conda-forge 2025-05-07T20:24:15.0237278Z pyopenssl-25.0.0 | pyhd8ed1ab_0 120 KB conda-forge 2025-05-07T20:24:15.0237724Z python-3.12.2 |hab00c5b_0_cpython 30.8 MB conda-forge 2025-05-07T20:24:15.0238154Z python_abi-3.12 | 7_cp312 7 KB conda-forge 2025-05-07T20:24:15.0238611Z typing-extensions-4.13.2 | h0e9735f_0 88 KB conda-forge 2025-05-07T20:24:15.0239095Z typing_extensions-4.13.2 | pyh29332c3_0 51 KB conda-forge 2025-05-07T20:24:15.0239723Z zlib-1.2.13 | h4ab18f5_6 91 KB conda-forge 2025-05-07T20:24:15.0240277Z ------------------------------------------------------------ 2025-05-07T20:24:15.0240737Z Total: 38.6 MB 2025-05-07T20:24:15.0241034Z 2025-05-07T20:24:15.0241187Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:15.0241417Z 2025-05-07T20:24:15.0241613Z cffi conda-forge/linux-64::cffi-1.17.1-py312h06ac9bb_0 2025-05-07T20:24:15.0242125Z cryptography conda-forge/linux-64::cryptography-44.0.3-py312hda17c39_0 2025-05-07T20:24:15.0242637Z libexpat conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 2025-05-07T20:24:15.0243080Z libgcc conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 2025-05-07T20:24:15.0243512Z libnsl conda-forge/linux-64::libnsl-2.0.1-hd590300_0 2025-05-07T20:24:15.0245585Z libsqlite conda-forge/linux-64::libsqlite-3.46.0-hde9e2c9_0 2025-05-07T20:24:15.0246131Z libxcrypt conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 2025-05-07T20:24:15.0246646Z libzlib conda-forge/linux-64::libzlib-1.2.13-h4ab18f5_6 2025-05-07T20:24:15.0247160Z pycparser conda-forge/noarch::pycparser-2.22-pyh29332c3_1 2025-05-07T20:24:15.0247698Z pyopenssl conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 2025-05-07T20:24:15.0248216Z python_abi conda-forge/noarch::python_abi-3.12-7_cp312 2025-05-07T20:24:15.0248799Z typing-extensions conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 2025-05-07T20:24:15.0249490Z typing_extensions conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 2025-05-07T20:24:15.0249984Z 2025-05-07T20:24:15.0250153Z The following packages will be UPDATED: 2025-05-07T20:24:15.0250481Z 2025-05-07T20:24:15.0250987Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:24:15.0251773Z libgcc-ng pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 2025-05-07T20:24:15.0252548Z libgomp pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 2025-05-07T20:24:15.0253186Z libuuid pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 2025-05-07T20:24:15.0253817Z openssl pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 2025-05-07T20:24:15.0254425Z zlib pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.2.13-h4ab18f5_6 2025-05-07T20:24:15.0254766Z 2025-05-07T20:24:15.0254990Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:24:15.0255335Z 2025-05-07T20:24:15.0255607Z expat pkgs/main::expat-2.7.1-h6a678d5_0 --> conda-forge::expat-2.7.0-h5888daf_0 2025-05-07T20:24:15.0256234Z python pkgs/main::python-3.12.9-h5148396_0 --> conda-forge::python-3.12.2-hab00c5b_0_cpython 2025-05-07T20:24:15.0256620Z 2025-05-07T20:24:15.0256629Z 2025-05-07T20:24:15.0256633Z 2025-05-07T20:24:15.0256775Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:15.0257165Z python-3.12.2 | 30.8 MB | | 0% 2025-05-07T20:24:15.0257398Z 2025-05-07T20:24:15.0257756Z openssl-3.5.0 | 3.0 MB | | 0%  2025-05-07T20:24:15.0257991Z 2025-05-07T20:24:15.0257995Z 2025-05-07T20:24:15.0258221Z cryptography-44.0.3 | 1.5 MB | | 0%  2025-05-07T20:24:15.0258603Z 2025-05-07T20:24:15.0258607Z 2025-05-07T20:24:15.0258610Z 2025-05-07T20:24:15.0280711Z libsqlite-3.46.0 | 845 KB | | 0%  2025-05-07T20:24:15.0281314Z 2025-05-07T20:24:15.0281320Z 2025-05-07T20:24:15.0281325Z 2025-05-07T20:24:15.0281336Z 2025-05-07T20:24:15.0293069Z libgcc-15.1.0 | 810 KB | | 0%  2025-05-07T20:24:15.0293434Z 2025-05-07T20:24:15.0293439Z 2025-05-07T20:24:15.0293444Z 2025-05-07T20:24:15.0293450Z 2025-05-07T20:24:15.0298548Z 2025-05-07T20:24:15.0300556Z libgomp-15.1.0 | 442 KB | | 0%  2025-05-07T20:24:15.0300927Z 2025-05-07T20:24:15.0300932Z 2025-05-07T20:24:15.0300937Z 2025-05-07T20:24:15.0300942Z 2025-05-07T20:24:15.0300947Z 2025-05-07T20:24:15.0300952Z 2025-05-07T20:24:15.0303420Z cffi-1.17.1 | 288 KB | | 0%  2025-05-07T20:24:15.0303771Z 2025-05-07T20:24:15.0303788Z 2025-05-07T20:24:15.0303793Z 2025-05-07T20:24:15.0303798Z 2025-05-07T20:24:15.0303803Z 2025-05-07T20:24:15.0303808Z 2025-05-07T20:24:15.0303813Z 2025-05-07T20:24:15.0315339Z expat-2.7.0 | 137 KB | | 0%  2025-05-07T20:24:15.0315975Z 2025-05-07T20:24:15.0315981Z 2025-05-07T20:24:15.0315986Z 2025-05-07T20:24:15.0315991Z 2025-05-07T20:24:15.0315996Z 2025-05-07T20:24:15.0316001Z 2025-05-07T20:24:15.0316006Z 2025-05-07T20:24:15.0316011Z 2025-05-07T20:24:15.0317153Z pyopenssl-25.0.0 | 120 KB | | 0%  2025-05-07T20:24:15.0317547Z 2025-05-07T20:24:15.0317553Z 2025-05-07T20:24:15.0317558Z 2025-05-07T20:24:15.0317563Z 2025-05-07T20:24:15.0317568Z 2025-05-07T20:24:15.0317573Z 2025-05-07T20:24:15.0317591Z 2025-05-07T20:24:15.0317597Z 2025-05-07T20:24:15.0317602Z 2025-05-07T20:24:15.0318900Z pycparser-2.22 | 108 KB | | 0%  2025-05-07T20:24:15.0319291Z 2025-05-07T20:24:15.0319296Z 2025-05-07T20:24:15.0319310Z 2025-05-07T20:24:15.0319315Z 2025-05-07T20:24:15.0319320Z 2025-05-07T20:24:15.0319326Z 2025-05-07T20:24:15.0319331Z 2025-05-07T20:24:15.0319336Z 2025-05-07T20:24:15.0319341Z 2025-05-07T20:24:15.0319350Z 2025-05-07T20:24:15.0322977Z libxcrypt-4.4.36 | 98 KB | | 0%  2025-05-07T20:24:15.0323375Z 2025-05-07T20:24:15.0323381Z 2025-05-07T20:24:15.0323385Z 2025-05-07T20:24:15.0323391Z 2025-05-07T20:24:15.0323395Z 2025-05-07T20:24:15.0323401Z 2025-05-07T20:24:15.0323406Z 2025-05-07T20:24:15.0323419Z 2025-05-07T20:24:15.0323425Z 2025-05-07T20:24:15.0323430Z 2025-05-07T20:24:15.0323435Z 2025-05-07T20:24:15.0324475Z zlib-1.2.13 | 91 KB | | 0%  2025-05-07T20:24:15.0324829Z 2025-05-07T20:24:15.0324834Z 2025-05-07T20:24:15.0324840Z 2025-05-07T20:24:15.0324845Z 2025-05-07T20:24:15.0324850Z 2025-05-07T20:24:15.0324855Z 2025-05-07T20:24:15.0324867Z 2025-05-07T20:24:15.0324873Z 2025-05-07T20:24:15.0324878Z 2025-05-07T20:24:15.0324883Z 2025-05-07T20:24:15.0324893Z 2025-05-07T20:24:15.0324898Z 2025-05-07T20:24:15.0329279Z typing-extensions-4. | 88 KB | | 0%  2025-05-07T20:24:15.0329651Z 2025-05-07T20:24:15.0329655Z 2025-05-07T20:24:15.0329658Z 2025-05-07T20:24:15.0329662Z 2025-05-07T20:24:15.0329665Z 2025-05-07T20:24:15.0329676Z 2025-05-07T20:24:15.0329679Z 2025-05-07T20:24:15.0329683Z 2025-05-07T20:24:15.0329686Z 2025-05-07T20:24:15.0329690Z 2025-05-07T20:24:15.0329693Z 2025-05-07T20:24:15.0329704Z 2025-05-07T20:24:15.0329708Z 2025-05-07T20:24:15.0330630Z libexpat-2.7.0 | 73 KB | | 0%  2025-05-07T20:24:15.0330925Z 2025-05-07T20:24:15.0330929Z 2025-05-07T20:24:15.0330932Z 2025-05-07T20:24:15.0330942Z 2025-05-07T20:24:15.0330946Z 2025-05-07T20:24:15.0330949Z 2025-05-07T20:24:15.0330953Z 2025-05-07T20:24:15.0331174Z 2025-05-07T20:24:15.0331178Z 2025-05-07T20:24:15.0331182Z 2025-05-07T20:24:15.0331185Z 2025-05-07T20:24:15.0331189Z 2025-05-07T20:24:15.0331192Z 2025-05-07T20:24:15.0331196Z 2025-05-07T20:24:15.0332279Z libzlib-1.2.13 | 60 KB | | 0%  2025-05-07T20:24:15.0332595Z 2025-05-07T20:24:15.0332599Z 2025-05-07T20:24:15.0332603Z 2025-05-07T20:24:15.0332606Z 2025-05-07T20:24:15.0332610Z 2025-05-07T20:24:15.0332614Z 2025-05-07T20:24:15.0332623Z 2025-05-07T20:24:15.0332627Z 2025-05-07T20:24:15.0332630Z 2025-05-07T20:24:15.0332809Z 2025-05-07T20:24:15.0332814Z 2025-05-07T20:24:15.0332818Z 2025-05-07T20:24:15.0332821Z 2025-05-07T20:24:15.0332827Z 2025-05-07T20:24:15.0332832Z 2025-05-07T20:24:15.0333491Z typing_extensions-4. | 51 KB | | 0%  2025-05-07T20:24:15.0333820Z 2025-05-07T20:24:15.0333833Z 2025-05-07T20:24:15.0333837Z 2025-05-07T20:24:15.0333840Z 2025-05-07T20:24:15.0333865Z 2025-05-07T20:24:15.0333869Z 2025-05-07T20:24:15.0333873Z 2025-05-07T20:24:15.0333876Z 2025-05-07T20:24:15.0333880Z 2025-05-07T20:24:15.0333884Z 2025-05-07T20:24:15.0333887Z 2025-05-07T20:24:15.0333891Z 2025-05-07T20:24:15.0333895Z 2025-05-07T20:24:15.0333898Z 2025-05-07T20:24:15.0333902Z 2025-05-07T20:24:15.0333906Z 2025-05-07T20:24:15.0334720Z libgcc-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:24:15.0335052Z 2025-05-07T20:24:15.0335057Z 2025-05-07T20:24:15.0335060Z 2025-05-07T20:24:15.0335064Z 2025-05-07T20:24:15.0335074Z 2025-05-07T20:24:15.0335079Z 2025-05-07T20:24:15.0335083Z 2025-05-07T20:24:15.0335087Z 2025-05-07T20:24:15.0335091Z 2025-05-07T20:24:15.0335095Z 2025-05-07T20:24:15.0335098Z 2025-05-07T20:24:15.0335102Z 2025-05-07T20:24:15.0335106Z 2025-05-07T20:24:15.0335109Z 2025-05-07T20:24:15.0335124Z 2025-05-07T20:24:15.0335128Z 2025-05-07T20:24:15.0335131Z 2025-05-07T20:24:15.0335924Z libuuid-2.38.1 | 33 KB | | 0%  2025-05-07T20:24:15.0336275Z 2025-05-07T20:24:15.0336298Z 2025-05-07T20:24:15.0336301Z 2025-05-07T20:24:15.0336305Z 2025-05-07T20:24:15.0336308Z 2025-05-07T20:24:15.0336312Z 2025-05-07T20:24:15.0336315Z 2025-05-07T20:24:15.0336319Z 2025-05-07T20:24:15.0336323Z 2025-05-07T20:24:15.0336326Z 2025-05-07T20:24:15.0336330Z 2025-05-07T20:24:15.0336333Z 2025-05-07T20:24:15.0336337Z 2025-05-07T20:24:15.0336340Z 2025-05-07T20:24:15.0336344Z 2025-05-07T20:24:15.0336347Z 2025-05-07T20:24:15.0336351Z 2025-05-07T20:24:15.0336360Z 2025-05-07T20:24:15.0337098Z libnsl-2.0.1 | 33 KB | | 0%  2025-05-07T20:24:15.0337443Z 2025-05-07T20:24:15.0337447Z 2025-05-07T20:24:15.0337450Z 2025-05-07T20:24:15.0337454Z 2025-05-07T20:24:15.0337457Z 2025-05-07T20:24:15.0337461Z 2025-05-07T20:24:15.0337465Z 2025-05-07T20:24:15.0337468Z 2025-05-07T20:24:15.0337476Z 2025-05-07T20:24:15.0337488Z 2025-05-07T20:24:15.0337491Z 2025-05-07T20:24:15.0337495Z 2025-05-07T20:24:15.0337498Z 2025-05-07T20:24:15.0337502Z 2025-05-07T20:24:15.0337506Z 2025-05-07T20:24:15.0337509Z 2025-05-07T20:24:15.0337513Z 2025-05-07T20:24:15.0337516Z 2025-05-07T20:24:15.0337520Z 2025-05-07T20:24:15.0959496Z ... (more hidden) ... 2025-05-07T20:24:15.0960027Z 2025-05-07T20:24:15.0962334Z 2025-05-07T20:24:15.1120329Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:15.1120683Z 2025-05-07T20:24:15.1120710Z 2025-05-07T20:24:15.1120717Z 2025-05-07T20:24:15.1233505Z libsqlite-3.46.0 | 845 KB | ########## | 100%  2025-05-07T20:24:15.1234800Z python-3.12.2 | 30.8 MB | | 0% 2025-05-07T20:24:15.1235238Z 2025-05-07T20:24:15.1285213Z openssl-3.5.0 | 3.0 MB | #2 | 13%  2025-05-07T20:24:15.1285505Z 2025-05-07T20:24:15.1285713Z 2025-05-07T20:24:15.1285717Z 2025-05-07T20:24:15.1288469Z 2025-05-07T20:24:15.1318646Z libgcc-15.1.0 | 810 KB | ######5 | 65%  2025-05-07T20:24:15.1318998Z 2025-05-07T20:24:15.1319003Z 2025-05-07T20:24:15.1319006Z 2025-05-07T20:24:15.1319010Z 2025-05-07T20:24:15.1319014Z 2025-05-07T20:24:15.1556621Z libgomp-15.1.0 | 442 KB | 3 | 4%  2025-05-07T20:24:15.1556999Z 2025-05-07T20:24:15.1557006Z 2025-05-07T20:24:15.1557108Z 2025-05-07T20:24:15.1557115Z 2025-05-07T20:24:15.1557121Z 2025-05-07T20:24:15.1557126Z 2025-05-07T20:24:15.1596514Z cffi-1.17.1 | 288 KB | 5 | 6%  2025-05-07T20:24:15.1596798Z 2025-05-07T20:24:15.1596802Z 2025-05-07T20:24:15.1596806Z 2025-05-07T20:24:15.1596809Z 2025-05-07T20:24:15.1601406Z 2025-05-07T20:24:15.1712095Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:15.1712455Z 2025-05-07T20:24:15.1712460Z 2025-05-07T20:24:15.1712481Z 2025-05-07T20:24:15.1712489Z 2025-05-07T20:24:15.1804977Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:15.1805359Z 2025-05-07T20:24:15.1805366Z 2025-05-07T20:24:15.1805373Z 2025-05-07T20:24:15.1805379Z 2025-05-07T20:24:15.1805386Z 2025-05-07T20:24:15.1805393Z 2025-05-07T20:24:15.2050822Z cffi-1.17.1 | 288 KB | ########## | 100%  2025-05-07T20:24:15.2051111Z 2025-05-07T20:24:15.2051115Z 2025-05-07T20:24:15.2051119Z 2025-05-07T20:24:15.2051123Z 2025-05-07T20:24:15.2051126Z 2025-05-07T20:24:15.2051137Z 2025-05-07T20:24:15.2051147Z 2025-05-07T20:24:15.2161187Z expat-2.7.0 | 137 KB | #1 | 12%  2025-05-07T20:24:15.2161478Z 2025-05-07T20:24:15.2161482Z 2025-05-07T20:24:15.2161559Z 2025-05-07T20:24:15.2161564Z 2025-05-07T20:24:15.2161568Z 2025-05-07T20:24:15.2161572Z 2025-05-07T20:24:15.2162981Z 2025-05-07T20:24:15.2193836Z expat-2.7.0 | 137 KB | ########## | 100%  2025-05-07T20:24:15.2194133Z 2025-05-07T20:24:15.2194138Z 2025-05-07T20:24:15.2194142Z 2025-05-07T20:24:15.2194145Z 2025-05-07T20:24:15.2194149Z 2025-05-07T20:24:15.2194153Z 2025-05-07T20:24:15.2194156Z 2025-05-07T20:24:15.2194164Z 2025-05-07T20:24:15.2234020Z pyopenssl-25.0.0 | 120 KB | #3 | 13%  2025-05-07T20:24:15.2284276Z python-3.12.2 | 30.8 MB | 7 | 8% 2025-05-07T20:24:15.2284614Z 2025-05-07T20:24:15.2284619Z 2025-05-07T20:24:15.2284631Z 2025-05-07T20:24:15.2284635Z 2025-05-07T20:24:15.2284638Z 2025-05-07T20:24:15.2284651Z 2025-05-07T20:24:15.2284654Z 2025-05-07T20:24:15.2284658Z 2025-05-07T20:24:15.2286029Z 2025-05-07T20:24:15.2309543Z pycparser-2.22 | 108 KB | #4 | 15%  2025-05-07T20:24:15.2309942Z 2025-05-07T20:24:15.2309948Z 2025-05-07T20:24:15.2309953Z 2025-05-07T20:24:15.2309958Z 2025-05-07T20:24:15.2309963Z 2025-05-07T20:24:15.2309968Z 2025-05-07T20:24:15.2309987Z 2025-05-07T20:24:15.2313737Z 2025-05-07T20:24:15.2396432Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:15.2396841Z 2025-05-07T20:24:15.2396846Z 2025-05-07T20:24:15.2396851Z 2025-05-07T20:24:15.2396855Z 2025-05-07T20:24:15.2396860Z 2025-05-07T20:24:15.2396865Z 2025-05-07T20:24:15.2396870Z 2025-05-07T20:24:15.2396875Z 2025-05-07T20:24:15.2396879Z 2025-05-07T20:24:15.2480819Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:15.2481214Z 2025-05-07T20:24:15.2481219Z 2025-05-07T20:24:15.2481233Z 2025-05-07T20:24:15.2493818Z libsqlite-3.46.0 | 845 KB | ########## | 100%  2025-05-07T20:24:15.2494187Z 2025-05-07T20:24:15.2494192Z 2025-05-07T20:24:15.2494405Z 2025-05-07T20:24:15.2553340Z libsqlite-3.46.0 | 845 KB | ########## | 100%  2025-05-07T20:24:15.2553699Z 2025-05-07T20:24:15.2553703Z 2025-05-07T20:24:15.2553707Z 2025-05-07T20:24:15.2553710Z 2025-05-07T20:24:15.2553948Z 2025-05-07T20:24:15.2553952Z 2025-05-07T20:24:15.2553956Z 2025-05-07T20:24:15.2553960Z 2025-05-07T20:24:15.2553963Z 2025-05-07T20:24:15.2553967Z 2025-05-07T20:24:15.2637591Z libxcrypt-4.4.36 | 98 KB | #6 | 16%  2025-05-07T20:24:15.2638008Z 2025-05-07T20:24:15.2638014Z 2025-05-07T20:24:15.2638019Z 2025-05-07T20:24:15.2638024Z 2025-05-07T20:24:15.2638029Z 2025-05-07T20:24:15.2638043Z 2025-05-07T20:24:15.2638047Z 2025-05-07T20:24:15.2638051Z 2025-05-07T20:24:15.2638054Z 2025-05-07T20:24:15.2639317Z 2025-05-07T20:24:15.2744025Z libxcrypt-4.4.36 | 98 KB | ########## | 100%  2025-05-07T20:24:15.2744334Z 2025-05-07T20:24:15.2744339Z 2025-05-07T20:24:15.2744342Z 2025-05-07T20:24:15.2744355Z 2025-05-07T20:24:15.2744359Z 2025-05-07T20:24:15.2744363Z 2025-05-07T20:24:15.2744366Z 2025-05-07T20:24:15.2744370Z 2025-05-07T20:24:15.2744373Z 2025-05-07T20:24:15.2744377Z 2025-05-07T20:24:15.2744380Z 2025-05-07T20:24:15.2806841Z zlib-1.2.13 | 91 KB | #7 | 18%  2025-05-07T20:24:15.2807118Z 2025-05-07T20:24:15.2807122Z 2025-05-07T20:24:15.2807126Z 2025-05-07T20:24:15.2807129Z 2025-05-07T20:24:15.2807133Z 2025-05-07T20:24:15.2807136Z 2025-05-07T20:24:15.2807140Z 2025-05-07T20:24:15.2807143Z 2025-05-07T20:24:15.2807147Z 2025-05-07T20:24:15.2807151Z 2025-05-07T20:24:15.2807160Z 2025-05-07T20:24:15.2921992Z zlib-1.2.13 | 91 KB | ########## | 100%  2025-05-07T20:24:15.2922377Z 2025-05-07T20:24:15.2922394Z 2025-05-07T20:24:15.2922400Z 2025-05-07T20:24:15.2922405Z 2025-05-07T20:24:15.2922410Z 2025-05-07T20:24:15.2922415Z 2025-05-07T20:24:15.2922420Z 2025-05-07T20:24:15.2922425Z 2025-05-07T20:24:15.2922430Z 2025-05-07T20:24:15.2922443Z 2025-05-07T20:24:15.2922447Z 2025-05-07T20:24:15.2922451Z 2025-05-07T20:24:15.2991472Z typing-extensions-4. | 88 KB | #8 | 18%  2025-05-07T20:24:15.2991923Z 2025-05-07T20:24:15.2991937Z 2025-05-07T20:24:15.2991944Z 2025-05-07T20:24:15.2991949Z 2025-05-07T20:24:15.2991954Z 2025-05-07T20:24:15.2991960Z 2025-05-07T20:24:15.2991964Z 2025-05-07T20:24:15.2991969Z 2025-05-07T20:24:15.2991974Z 2025-05-07T20:24:15.2991979Z 2025-05-07T20:24:15.2991984Z 2025-05-07T20:24:15.2992045Z 2025-05-07T20:24:15.3030303Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:15.3030953Z 2025-05-07T20:24:15.3031703Z openssl-3.5.0 | 3.0 MB | ########## | 100%  2025-05-07T20:24:15.3031956Z 2025-05-07T20:24:15.3070399Z openssl-3.5.0 | 3.0 MB | ########## | 100%  2025-05-07T20:24:15.3070723Z 2025-05-07T20:24:15.3070735Z 2025-05-07T20:24:15.3070740Z 2025-05-07T20:24:15.3070745Z 2025-05-07T20:24:15.3070750Z 2025-05-07T20:24:15.3070755Z 2025-05-07T20:24:15.3070760Z 2025-05-07T20:24:15.3070765Z 2025-05-07T20:24:15.3070770Z 2025-05-07T20:24:15.3070783Z 2025-05-07T20:24:15.3070788Z 2025-05-07T20:24:15.3070794Z 2025-05-07T20:24:15.3070799Z 2025-05-07T20:24:15.3121186Z libexpat-2.7.0 | 73 KB | ##2 | 22%  2025-05-07T20:24:15.3121490Z 2025-05-07T20:24:15.3121493Z 2025-05-07T20:24:15.3121497Z 2025-05-07T20:24:15.3121501Z 2025-05-07T20:24:15.3121504Z 2025-05-07T20:24:15.3121508Z 2025-05-07T20:24:15.3121511Z 2025-05-07T20:24:15.3121515Z 2025-05-07T20:24:15.3121518Z 2025-05-07T20:24:15.3121522Z 2025-05-07T20:24:15.3121525Z 2025-05-07T20:24:15.3121529Z 2025-05-07T20:24:15.3122757Z 2025-05-07T20:24:15.3235856Z libexpat-2.7.0 | 73 KB | ########## | 100%  2025-05-07T20:24:15.3236164Z 2025-05-07T20:24:15.3236168Z 2025-05-07T20:24:15.3236172Z 2025-05-07T20:24:15.3236175Z 2025-05-07T20:24:15.3236179Z 2025-05-07T20:24:15.3236183Z 2025-05-07T20:24:15.3236195Z 2025-05-07T20:24:15.3236199Z 2025-05-07T20:24:15.3236203Z 2025-05-07T20:24:15.3236420Z 2025-05-07T20:24:15.3236424Z 2025-05-07T20:24:15.3236428Z 2025-05-07T20:24:15.3236431Z 2025-05-07T20:24:15.3237847Z 2025-05-07T20:24:15.3270354Z libzlib-1.2.13 | 60 KB | ##6 | 27%  2025-05-07T20:24:15.3270663Z 2025-05-07T20:24:15.3270667Z 2025-05-07T20:24:15.3270671Z 2025-05-07T20:24:15.3270675Z 2025-05-07T20:24:15.3270678Z 2025-05-07T20:24:15.3270682Z 2025-05-07T20:24:15.3270685Z 2025-05-07T20:24:15.3270689Z 2025-05-07T20:24:15.3270693Z 2025-05-07T20:24:15.3270696Z 2025-05-07T20:24:15.3270700Z 2025-05-07T20:24:15.3270703Z 2025-05-07T20:24:15.3270862Z 2025-05-07T20:24:15.3271364Z 2025-05-07T20:24:15.3366470Z libzlib-1.2.13 | 60 KB | ########## | 100%  2025-05-07T20:24:15.3408822Z python-3.12.2 | 30.8 MB | #2 | 13% 2025-05-07T20:24:15.3409075Z 2025-05-07T20:24:15.3409101Z 2025-05-07T20:24:15.3409106Z 2025-05-07T20:24:15.3409127Z 2025-05-07T20:24:15.3409132Z 2025-05-07T20:24:15.3409167Z 2025-05-07T20:24:15.3409172Z 2025-05-07T20:24:15.3409178Z 2025-05-07T20:24:15.3409182Z 2025-05-07T20:24:15.3409186Z 2025-05-07T20:24:15.3409206Z 2025-05-07T20:24:15.3409225Z 2025-05-07T20:24:15.3409236Z 2025-05-07T20:24:15.3409309Z 2025-05-07T20:24:15.3412582Z 2025-05-07T20:24:15.3449085Z typing_extensions-4. | 51 KB | ###1 | 31%  2025-05-07T20:24:15.3449547Z 2025-05-07T20:24:15.3449555Z 2025-05-07T20:24:15.3449562Z 2025-05-07T20:24:15.3449569Z 2025-05-07T20:24:15.3451690Z 2025-05-07T20:24:15.3459775Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:15.3460162Z 2025-05-07T20:24:15.3460171Z 2025-05-07T20:24:15.3460181Z 2025-05-07T20:24:15.3460188Z 2025-05-07T20:24:15.3460194Z 2025-05-07T20:24:15.3460201Z 2025-05-07T20:24:15.3460207Z 2025-05-07T20:24:15.3460215Z 2025-05-07T20:24:15.3460221Z 2025-05-07T20:24:15.3460226Z 2025-05-07T20:24:15.3460232Z 2025-05-07T20:24:15.3460252Z 2025-05-07T20:24:15.3460258Z 2025-05-07T20:24:15.3460263Z 2025-05-07T20:24:15.3461674Z 2025-05-07T20:24:15.3473281Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:15.3473722Z 2025-05-07T20:24:15.3473727Z 2025-05-07T20:24:15.3473733Z 2025-05-07T20:24:15.3473738Z 2025-05-07T20:24:15.3473753Z 2025-05-07T20:24:15.3499217Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:15.3499585Z 2025-05-07T20:24:15.3499590Z 2025-05-07T20:24:15.3499596Z 2025-05-07T20:24:15.3499601Z 2025-05-07T20:24:15.3499624Z 2025-05-07T20:24:15.3499629Z 2025-05-07T20:24:15.3499634Z 2025-05-07T20:24:15.3499640Z 2025-05-07T20:24:15.3499645Z 2025-05-07T20:24:15.3499650Z 2025-05-07T20:24:15.3499655Z 2025-05-07T20:24:15.3499660Z 2025-05-07T20:24:15.3499665Z 2025-05-07T20:24:15.3499670Z 2025-05-07T20:24:15.3499675Z 2025-05-07T20:24:15.3499680Z 2025-05-07T20:24:15.3502506Z 2025-05-07T20:24:15.3541314Z libuuid-2.38.1 | 33 KB | ####8 | 49%  2025-05-07T20:24:15.3541752Z 2025-05-07T20:24:15.3541758Z 2025-05-07T20:24:15.3541763Z 2025-05-07T20:24:15.3541768Z 2025-05-07T20:24:15.3541773Z 2025-05-07T20:24:15.3541778Z 2025-05-07T20:24:15.3541783Z 2025-05-07T20:24:15.3541789Z 2025-05-07T20:24:15.3541794Z 2025-05-07T20:24:15.3541799Z 2025-05-07T20:24:15.3541804Z 2025-05-07T20:24:15.3541809Z 2025-05-07T20:24:15.3541814Z 2025-05-07T20:24:15.3541827Z 2025-05-07T20:24:15.3541832Z 2025-05-07T20:24:15.3543745Z 2025-05-07T20:24:15.3571432Z libgcc-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:24:15.3571860Z 2025-05-07T20:24:15.3571866Z 2025-05-07T20:24:15.3571871Z 2025-05-07T20:24:15.3571876Z 2025-05-07T20:24:15.3571881Z 2025-05-07T20:24:15.3571886Z 2025-05-07T20:24:15.3571891Z 2025-05-07T20:24:15.3571896Z 2025-05-07T20:24:15.3571901Z 2025-05-07T20:24:15.3571906Z 2025-05-07T20:24:15.3572210Z 2025-05-07T20:24:15.3572215Z 2025-05-07T20:24:15.3572220Z 2025-05-07T20:24:15.3572225Z 2025-05-07T20:24:15.3572230Z 2025-05-07T20:24:15.3572235Z 2025-05-07T20:24:15.3572240Z 2025-05-07T20:24:15.3578744Z libuuid-2.38.1 | 33 KB | ########## | 100%  2025-05-07T20:24:15.3579142Z 2025-05-07T20:24:15.3579147Z 2025-05-07T20:24:15.3579150Z 2025-05-07T20:24:15.3579154Z 2025-05-07T20:24:15.3579157Z 2025-05-07T20:24:15.3579161Z 2025-05-07T20:24:15.3579165Z 2025-05-07T20:24:15.3579168Z 2025-05-07T20:24:15.3579172Z 2025-05-07T20:24:15.3579382Z 2025-05-07T20:24:15.3579395Z 2025-05-07T20:24:15.3579399Z 2025-05-07T20:24:15.3579403Z 2025-05-07T20:24:15.3579406Z 2025-05-07T20:24:15.3579410Z 2025-05-07T20:24:15.3580184Z 2025-05-07T20:24:15.3718896Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:15.3719220Z 2025-05-07T20:24:15.3719224Z 2025-05-07T20:24:15.3719242Z 2025-05-07T20:24:15.3719246Z 2025-05-07T20:24:15.3719249Z 2025-05-07T20:24:15.3719262Z 2025-05-07T20:24:15.3719266Z 2025-05-07T20:24:15.3719270Z 2025-05-07T20:24:15.3719273Z 2025-05-07T20:24:15.3719277Z 2025-05-07T20:24:15.3719280Z 2025-05-07T20:24:15.3719284Z 2025-05-07T20:24:15.3719287Z 2025-05-07T20:24:15.3719291Z 2025-05-07T20:24:15.3719294Z 2025-05-07T20:24:15.3719298Z 2025-05-07T20:24:15.3719302Z 2025-05-07T20:24:15.3719305Z 2025-05-07T20:24:15.3754264Z libnsl-2.0.1 | 33 KB | ####9 | 49%  2025-05-07T20:24:15.3754603Z 2025-05-07T20:24:15.3754608Z 2025-05-07T20:24:15.3754612Z 2025-05-07T20:24:15.3754616Z 2025-05-07T20:24:15.3754619Z 2025-05-07T20:24:15.3754623Z 2025-05-07T20:24:15.3754627Z 2025-05-07T20:24:15.3754630Z 2025-05-07T20:24:15.3754634Z 2025-05-07T20:24:15.3754638Z 2025-05-07T20:24:15.3754641Z 2025-05-07T20:24:15.3754645Z 2025-05-07T20:24:15.3754649Z 2025-05-07T20:24:15.3754652Z 2025-05-07T20:24:15.3754675Z 2025-05-07T20:24:15.3754679Z 2025-05-07T20:24:15.3754683Z 2025-05-07T20:24:15.3757057Z 2025-05-07T20:24:15.3820138Z libnsl-2.0.1 | 33 KB | ########## | 100%  2025-05-07T20:24:15.3820617Z 2025-05-07T20:24:15.3820624Z 2025-05-07T20:24:15.3820629Z 2025-05-07T20:24:15.3820634Z 2025-05-07T20:24:15.3820639Z 2025-05-07T20:24:15.3820644Z 2025-05-07T20:24:15.3820649Z 2025-05-07T20:24:15.3820654Z 2025-05-07T20:24:15.3820659Z 2025-05-07T20:24:15.3820665Z 2025-05-07T20:24:15.3820670Z 2025-05-07T20:24:15.3820675Z 2025-05-07T20:24:15.3820691Z 2025-05-07T20:24:15.3820696Z 2025-05-07T20:24:15.3820702Z 2025-05-07T20:24:15.3820707Z 2025-05-07T20:24:15.3820712Z 2025-05-07T20:24:15.3820717Z 2025-05-07T20:24:15.3820722Z 2025-05-07T20:24:15.3841620Z ... (more hidden) ... 2025-05-07T20:24:15.3841998Z 2025-05-07T20:24:15.3842002Z 2025-05-07T20:24:15.3842006Z 2025-05-07T20:24:15.3842019Z 2025-05-07T20:24:15.3842023Z 2025-05-07T20:24:15.3842027Z 2025-05-07T20:24:15.3842038Z 2025-05-07T20:24:15.3842042Z 2025-05-07T20:24:15.3842045Z 2025-05-07T20:24:15.3842049Z 2025-05-07T20:24:15.3842052Z 2025-05-07T20:24:15.3842056Z 2025-05-07T20:24:15.3842060Z 2025-05-07T20:24:15.3842063Z 2025-05-07T20:24:15.3842067Z 2025-05-07T20:24:15.3842070Z 2025-05-07T20:24:15.3842074Z 2025-05-07T20:24:15.3842077Z 2025-05-07T20:24:15.3842081Z 2025-05-07T20:24:15.4280988Z ... (more hidden) ... 2025-05-07T20:24:15.4281359Z 2025-05-07T20:24:15.4281405Z 2025-05-07T20:24:15.4281411Z 2025-05-07T20:24:15.4281415Z 2025-05-07T20:24:15.4422054Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:15.4556665Z python-3.12.2 | 30.8 MB | ##2 | 23% 2025-05-07T20:24:15.4556984Z 2025-05-07T20:24:15.4556989Z 2025-05-07T20:24:15.4557000Z 2025-05-07T20:24:15.4557005Z 2025-05-07T20:24:15.4557239Z 2025-05-07T20:24:15.4557243Z 2025-05-07T20:24:15.4565349Z cffi-1.17.1 | 288 KB | ########## | 100%  2025-05-07T20:24:15.4565716Z 2025-05-07T20:24:15.4565727Z 2025-05-07T20:24:15.4565730Z 2025-05-07T20:24:15.4565734Z 2025-05-07T20:24:15.4565738Z 2025-05-07T20:24:15.4568786Z 2025-05-07T20:24:15.4903733Z cffi-1.17.1 | 288 KB | ########## | 100%  2025-05-07T20:24:15.4904019Z 2025-05-07T20:24:15.4904023Z 2025-05-07T20:24:15.4904027Z 2025-05-07T20:24:15.4904030Z 2025-05-07T20:24:15.4904035Z 2025-05-07T20:24:15.4904038Z 2025-05-07T20:24:15.4904314Z 2025-05-07T20:24:15.4915728Z expat-2.7.0 | 137 KB | ########## | 100%  2025-05-07T20:24:15.4916041Z 2025-05-07T20:24:15.4916045Z 2025-05-07T20:24:15.4916048Z 2025-05-07T20:24:15.4916052Z 2025-05-07T20:24:15.4916056Z 2025-05-07T20:24:15.4916059Z 2025-05-07T20:24:15.4916063Z 2025-05-07T20:24:15.5181279Z expat-2.7.0 | 137 KB | ########## | 100%  2025-05-07T20:24:15.5181711Z 2025-05-07T20:24:15.5181718Z 2025-05-07T20:24:15.5181723Z 2025-05-07T20:24:15.5181728Z 2025-05-07T20:24:15.5181733Z 2025-05-07T20:24:15.5181739Z 2025-05-07T20:24:15.5181744Z 2025-05-07T20:24:15.5183692Z 2025-05-07T20:24:15.5188097Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:15.5188441Z 2025-05-07T20:24:15.5188446Z 2025-05-07T20:24:15.5188451Z 2025-05-07T20:24:15.5188456Z 2025-05-07T20:24:15.5188461Z 2025-05-07T20:24:15.5188466Z 2025-05-07T20:24:15.5188471Z 2025-05-07T20:24:15.5188476Z 2025-05-07T20:24:15.5422704Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:15.5756206Z python-3.12.2 | 30.8 MB | ###4 | 35% 2025-05-07T20:24:15.5756575Z 2025-05-07T20:24:15.5756579Z 2025-05-07T20:24:15.5756583Z 2025-05-07T20:24:15.5756586Z 2025-05-07T20:24:15.5756590Z 2025-05-07T20:24:15.5756594Z 2025-05-07T20:24:15.5756597Z 2025-05-07T20:24:15.5756615Z 2025-05-07T20:24:15.5756618Z 2025-05-07T20:24:15.5756622Z 2025-05-07T20:24:15.5759659Z libxcrypt-4.4.36 | 98 KB | ########## | 100%  2025-05-07T20:24:15.5759962Z 2025-05-07T20:24:15.5759968Z 2025-05-07T20:24:15.5759973Z 2025-05-07T20:24:15.5759978Z 2025-05-07T20:24:15.5759983Z 2025-05-07T20:24:15.5759988Z 2025-05-07T20:24:15.5759993Z 2025-05-07T20:24:15.5759998Z 2025-05-07T20:24:15.5760003Z 2025-05-07T20:24:15.5760167Z 2025-05-07T20:24:15.6245790Z libxcrypt-4.4.36 | 98 KB | ########## | 100%  2025-05-07T20:24:15.6246181Z 2025-05-07T20:24:15.6246185Z 2025-05-07T20:24:15.6246189Z 2025-05-07T20:24:15.6246193Z 2025-05-07T20:24:15.6246198Z 2025-05-07T20:24:15.6246209Z 2025-05-07T20:24:15.6246212Z 2025-05-07T20:24:15.6246216Z 2025-05-07T20:24:15.6246220Z 2025-05-07T20:24:15.6246223Z 2025-05-07T20:24:15.6246865Z 2025-05-07T20:24:15.6252998Z zlib-1.2.13 | 91 KB | ########## | 100%  2025-05-07T20:24:15.6253369Z 2025-05-07T20:24:15.6253376Z 2025-05-07T20:24:15.6253381Z 2025-05-07T20:24:15.6253386Z 2025-05-07T20:24:15.6253391Z 2025-05-07T20:24:15.6253395Z 2025-05-07T20:24:15.6253400Z 2025-05-07T20:24:15.6253405Z 2025-05-07T20:24:15.6253411Z 2025-05-07T20:24:15.6253416Z 2025-05-07T20:24:15.6253421Z 2025-05-07T20:24:15.6423669Z zlib-1.2.13 | 91 KB | ########## | 100%  2025-05-07T20:24:15.6559093Z python-3.12.2 | 30.8 MB | ####7 | 47% 2025-05-07T20:24:15.6559369Z 2025-05-07T20:24:15.6559373Z 2025-05-07T20:24:15.6559388Z 2025-05-07T20:24:15.6559391Z 2025-05-07T20:24:15.6559395Z 2025-05-07T20:24:15.6559399Z 2025-05-07T20:24:15.6559403Z 2025-05-07T20:24:15.6559406Z 2025-05-07T20:24:15.6559410Z 2025-05-07T20:24:15.6559414Z 2025-05-07T20:24:15.6559418Z 2025-05-07T20:24:15.6559702Z 2025-05-07T20:24:15.6565876Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:15.6566467Z 2025-05-07T20:24:15.6566472Z 2025-05-07T20:24:15.6566476Z 2025-05-07T20:24:15.6566480Z 2025-05-07T20:24:15.6566483Z 2025-05-07T20:24:15.6566487Z 2025-05-07T20:24:15.6566491Z 2025-05-07T20:24:15.6566494Z 2025-05-07T20:24:15.6566498Z 2025-05-07T20:24:15.6566502Z 2025-05-07T20:24:15.6566506Z 2025-05-07T20:24:15.6568525Z 2025-05-07T20:24:15.7425665Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:15.7687534Z python-3.12.2 | 30.8 MB | ######3 | 63% 2025-05-07T20:24:15.7687802Z 2025-05-07T20:24:15.7688068Z 2025-05-07T20:24:15.7688075Z 2025-05-07T20:24:15.7688080Z 2025-05-07T20:24:15.7688085Z 2025-05-07T20:24:15.7688091Z 2025-05-07T20:24:15.7688096Z 2025-05-07T20:24:15.7688102Z 2025-05-07T20:24:15.7688113Z 2025-05-07T20:24:15.7701108Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:15.7701456Z 2025-05-07T20:24:15.7701460Z 2025-05-07T20:24:15.7701479Z 2025-05-07T20:24:15.7701483Z 2025-05-07T20:24:15.7701487Z 2025-05-07T20:24:15.7701490Z 2025-05-07T20:24:15.7701506Z 2025-05-07T20:24:15.7701510Z 2025-05-07T20:24:15.7701915Z 2025-05-07T20:24:15.7982550Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:15.7982884Z 2025-05-07T20:24:15.7982900Z 2025-05-07T20:24:15.7995297Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:15.7995709Z 2025-05-07T20:24:15.7995715Z 2025-05-07T20:24:15.8024912Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:15.8025300Z 2025-05-07T20:24:15.8025305Z 2025-05-07T20:24:15.8025309Z 2025-05-07T20:24:15.8025312Z 2025-05-07T20:24:15.8025316Z 2025-05-07T20:24:15.8025320Z 2025-05-07T20:24:15.8025323Z 2025-05-07T20:24:15.8025334Z 2025-05-07T20:24:15.8025338Z 2025-05-07T20:24:15.8025341Z 2025-05-07T20:24:15.8025345Z 2025-05-07T20:24:15.8025348Z 2025-05-07T20:24:15.8025352Z 2025-05-07T20:24:15.8033598Z libexpat-2.7.0 | 73 KB | ########## | 100%  2025-05-07T20:24:15.8033958Z 2025-05-07T20:24:15.8033964Z 2025-05-07T20:24:15.8033969Z 2025-05-07T20:24:15.8033974Z 2025-05-07T20:24:15.8033980Z 2025-05-07T20:24:15.8033985Z 2025-05-07T20:24:15.8033989Z 2025-05-07T20:24:15.8033994Z 2025-05-07T20:24:15.8033999Z 2025-05-07T20:24:15.8034004Z 2025-05-07T20:24:15.8034009Z 2025-05-07T20:24:15.8034014Z 2025-05-07T20:24:15.8034019Z 2025-05-07T20:24:15.8345102Z libexpat-2.7.0 | 73 KB | ########## | 100%  2025-05-07T20:24:15.8345486Z 2025-05-07T20:24:15.8345492Z 2025-05-07T20:24:15.8345497Z 2025-05-07T20:24:15.8345502Z 2025-05-07T20:24:15.8345507Z 2025-05-07T20:24:15.8345512Z 2025-05-07T20:24:15.8345517Z 2025-05-07T20:24:15.8345522Z 2025-05-07T20:24:15.8345528Z 2025-05-07T20:24:15.8345533Z 2025-05-07T20:24:15.8345538Z 2025-05-07T20:24:15.8345545Z 2025-05-07T20:24:15.8345550Z 2025-05-07T20:24:15.8347520Z 2025-05-07T20:24:15.8354031Z libzlib-1.2.13 | 60 KB | ########## | 100%  2025-05-07T20:24:15.8354345Z 2025-05-07T20:24:15.8354349Z 2025-05-07T20:24:15.8354360Z 2025-05-07T20:24:15.8354364Z 2025-05-07T20:24:15.8354367Z 2025-05-07T20:24:15.8354371Z 2025-05-07T20:24:15.8354375Z 2025-05-07T20:24:15.8354378Z 2025-05-07T20:24:15.8354382Z 2025-05-07T20:24:15.8354388Z 2025-05-07T20:24:15.8354393Z 2025-05-07T20:24:15.8354398Z 2025-05-07T20:24:15.8354403Z 2025-05-07T20:24:15.8354408Z 2025-05-07T20:24:15.8381737Z libzlib-1.2.13 | 60 KB | ########## | 100%  2025-05-07T20:24:15.8382199Z 2025-05-07T20:24:15.8382205Z 2025-05-07T20:24:15.8382211Z 2025-05-07T20:24:15.8382216Z 2025-05-07T20:24:15.8382221Z 2025-05-07T20:24:15.8382226Z 2025-05-07T20:24:15.8382231Z 2025-05-07T20:24:15.8382236Z 2025-05-07T20:24:15.8382241Z 2025-05-07T20:24:15.8382246Z 2025-05-07T20:24:15.8382252Z 2025-05-07T20:24:15.8382539Z 2025-05-07T20:24:15.8382545Z 2025-05-07T20:24:15.8382550Z 2025-05-07T20:24:15.8385611Z 2025-05-07T20:24:15.8392002Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:15.8392464Z 2025-05-07T20:24:15.8392468Z 2025-05-07T20:24:15.8392472Z 2025-05-07T20:24:15.8392476Z 2025-05-07T20:24:15.8392488Z 2025-05-07T20:24:15.8392492Z 2025-05-07T20:24:15.8392495Z 2025-05-07T20:24:15.8392499Z 2025-05-07T20:24:15.8392503Z 2025-05-07T20:24:15.8392509Z 2025-05-07T20:24:15.8392514Z 2025-05-07T20:24:15.8392519Z 2025-05-07T20:24:15.8392523Z 2025-05-07T20:24:15.8392765Z 2025-05-07T20:24:15.8395950Z 2025-05-07T20:24:15.8428259Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:15.8854105Z python-3.12.2 | 30.8 MB | ######## | 81% 2025-05-07T20:24:15.8854407Z 2025-05-07T20:24:15.8854413Z 2025-05-07T20:24:15.8854419Z 2025-05-07T20:24:15.8854423Z 2025-05-07T20:24:15.8854447Z 2025-05-07T20:24:15.8854452Z 2025-05-07T20:24:15.8854457Z 2025-05-07T20:24:15.8854462Z 2025-05-07T20:24:15.8854466Z 2025-05-07T20:24:15.8854471Z 2025-05-07T20:24:15.8854477Z 2025-05-07T20:24:15.8854482Z 2025-05-07T20:24:15.8854487Z 2025-05-07T20:24:15.8854492Z 2025-05-07T20:24:15.8854497Z 2025-05-07T20:24:15.8854503Z 2025-05-07T20:24:15.8858516Z 2025-05-07T20:24:15.8868865Z libuuid-2.38.1 | 33 KB | ########## | 100%  2025-05-07T20:24:15.8869204Z 2025-05-07T20:24:15.8869209Z 2025-05-07T20:24:15.8869214Z 2025-05-07T20:24:15.8869231Z 2025-05-07T20:24:15.8869236Z 2025-05-07T20:24:15.8869242Z 2025-05-07T20:24:15.8869255Z 2025-05-07T20:24:15.8869260Z 2025-05-07T20:24:15.8869265Z 2025-05-07T20:24:15.8869270Z 2025-05-07T20:24:15.8869275Z 2025-05-07T20:24:15.8869281Z 2025-05-07T20:24:15.8869286Z 2025-05-07T20:24:15.8869291Z 2025-05-07T20:24:15.8869296Z 2025-05-07T20:24:15.8869301Z 2025-05-07T20:24:15.8870778Z 2025-05-07T20:24:15.9334417Z libuuid-2.38.1 | 33 KB | ########## | 100%  2025-05-07T20:24:15.9334901Z 2025-05-07T20:24:15.9334907Z 2025-05-07T20:24:15.9334912Z 2025-05-07T20:24:15.9334917Z 2025-05-07T20:24:15.9334923Z 2025-05-07T20:24:15.9334928Z 2025-05-07T20:24:15.9334933Z 2025-05-07T20:24:15.9334938Z 2025-05-07T20:24:15.9334943Z 2025-05-07T20:24:15.9334948Z 2025-05-07T20:24:15.9334955Z 2025-05-07T20:24:15.9334960Z 2025-05-07T20:24:15.9334965Z 2025-05-07T20:24:15.9334971Z 2025-05-07T20:24:15.9334978Z 2025-05-07T20:24:15.9334985Z 2025-05-07T20:24:15.9341889Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:15.9342204Z 2025-05-07T20:24:15.9342209Z 2025-05-07T20:24:15.9342213Z 2025-05-07T20:24:15.9342216Z 2025-05-07T20:24:15.9342227Z 2025-05-07T20:24:15.9342230Z 2025-05-07T20:24:15.9342234Z 2025-05-07T20:24:15.9342237Z 2025-05-07T20:24:15.9342241Z 2025-05-07T20:24:15.9342256Z 2025-05-07T20:24:15.9342262Z 2025-05-07T20:24:15.9342266Z 2025-05-07T20:24:15.9342271Z 2025-05-07T20:24:15.9342276Z 2025-05-07T20:24:15.9342281Z 2025-05-07T20:24:15.9342286Z 2025-05-07T20:24:15.9411218Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:15.9411535Z 2025-05-07T20:24:15.9411539Z 2025-05-07T20:24:15.9411543Z 2025-05-07T20:24:15.9411546Z 2025-05-07T20:24:15.9411550Z 2025-05-07T20:24:15.9411561Z 2025-05-07T20:24:15.9411565Z 2025-05-07T20:24:15.9411568Z 2025-05-07T20:24:15.9411572Z 2025-05-07T20:24:15.9411591Z 2025-05-07T20:24:15.9411595Z 2025-05-07T20:24:15.9411599Z 2025-05-07T20:24:15.9411603Z 2025-05-07T20:24:15.9411606Z 2025-05-07T20:24:15.9411610Z 2025-05-07T20:24:15.9411614Z 2025-05-07T20:24:15.9411617Z 2025-05-07T20:24:15.9411621Z 2025-05-07T20:24:15.9416330Z libnsl-2.0.1 | 33 KB | ########## | 100%  2025-05-07T20:24:15.9416868Z 2025-05-07T20:24:15.9416873Z 2025-05-07T20:24:15.9416877Z 2025-05-07T20:24:15.9416881Z 2025-05-07T20:24:15.9416884Z 2025-05-07T20:24:15.9416888Z 2025-05-07T20:24:15.9416891Z 2025-05-07T20:24:15.9416895Z 2025-05-07T20:24:15.9416899Z 2025-05-07T20:24:15.9416902Z 2025-05-07T20:24:15.9416906Z 2025-05-07T20:24:15.9416909Z 2025-05-07T20:24:15.9416913Z 2025-05-07T20:24:15.9416917Z 2025-05-07T20:24:15.9416926Z 2025-05-07T20:24:15.9416930Z 2025-05-07T20:24:15.9416933Z 2025-05-07T20:24:15.9416937Z 2025-05-07T20:24:15.9442053Z libnsl-2.0.1 | 33 KB | ########## | 100%  2025-05-07T20:24:15.9555925Z python-3.12.2 | 30.8 MB | #########4 | 95% 2025-05-07T20:24:15.9556270Z 2025-05-07T20:24:15.9556277Z 2025-05-07T20:24:15.9556282Z 2025-05-07T20:24:15.9556286Z 2025-05-07T20:24:15.9556309Z 2025-05-07T20:24:15.9556321Z 2025-05-07T20:24:15.9556325Z 2025-05-07T20:24:15.9556328Z 2025-05-07T20:24:15.9556332Z 2025-05-07T20:24:15.9556347Z 2025-05-07T20:24:15.9556351Z 2025-05-07T20:24:15.9556355Z 2025-05-07T20:24:15.9556358Z 2025-05-07T20:24:15.9556362Z 2025-05-07T20:24:15.9556536Z 2025-05-07T20:24:15.9556542Z 2025-05-07T20:24:15.9556545Z 2025-05-07T20:24:15.9556550Z 2025-05-07T20:24:15.9556730Z 2025-05-07T20:24:15.9900650Z ... (more hidden) ... 2025-05-07T20:24:15.9900967Z 2025-05-07T20:24:16.0032425Z openssl-3.5.0 | 3.0 MB | ########## | 100%  2025-05-07T20:24:16.6906565Z python-3.12.2 | 30.8 MB | ########## | 100% 2025-05-07T20:24:16.6913035Z python-3.12.2 | 30.8 MB | ########## | 100% 2025-05-07T20:24:16.6913313Z 2025-05-07T20:24:16.6913320Z 2025-05-07T20:24:16.6913324Z 2025-05-07T20:24:16.6913338Z 2025-05-07T20:24:16.6913343Z 2025-05-07T20:24:16.6913348Z 2025-05-07T20:24:16.6913352Z 2025-05-07T20:24:16.6913357Z 2025-05-07T20:24:16.6913361Z 2025-05-07T20:24:16.6913366Z 2025-05-07T20:24:16.6913370Z 2025-05-07T20:24:16.6913386Z 2025-05-07T20:24:16.6913390Z 2025-05-07T20:24:16.6913393Z 2025-05-07T20:24:16.6913397Z 2025-05-07T20:24:16.6913400Z 2025-05-07T20:24:16.6913404Z 2025-05-07T20:24:16.6913407Z 2025-05-07T20:24:16.6913411Z 2025-05-07T20:24:16.6913504Z 2025-05-07T20:24:16.6914003Z  2025-05-07T20:24:16.6914332Z 2025-05-07T20:24:16.6914527Z 2025-05-07T20:24:16.6914698Z  2025-05-07T20:24:16.6914896Z 2025-05-07T20:24:16.6914900Z 2025-05-07T20:24:16.6915071Z  2025-05-07T20:24:16.6915285Z 2025-05-07T20:24:16.6915289Z 2025-05-07T20:24:16.6915316Z 2025-05-07T20:24:16.6915488Z  2025-05-07T20:24:16.6915707Z 2025-05-07T20:24:16.6915712Z 2025-05-07T20:24:16.6915716Z 2025-05-07T20:24:16.6915721Z 2025-05-07T20:24:16.6915926Z  2025-05-07T20:24:16.6916150Z 2025-05-07T20:24:16.6916154Z 2025-05-07T20:24:16.6916158Z 2025-05-07T20:24:16.6916161Z 2025-05-07T20:24:16.6916165Z 2025-05-07T20:24:16.6916338Z  2025-05-07T20:24:16.6916557Z 2025-05-07T20:24:16.6916560Z 2025-05-07T20:24:16.6916564Z 2025-05-07T20:24:16.6916568Z 2025-05-07T20:24:16.6916571Z 2025-05-07T20:24:16.6916575Z 2025-05-07T20:24:16.6916752Z  2025-05-07T20:24:16.6916977Z 2025-05-07T20:24:16.6916981Z 2025-05-07T20:24:16.6916984Z 2025-05-07T20:24:16.6916988Z 2025-05-07T20:24:16.6916991Z 2025-05-07T20:24:16.6916995Z 2025-05-07T20:24:16.6916999Z 2025-05-07T20:24:16.6917175Z  2025-05-07T20:24:16.6917407Z 2025-05-07T20:24:16.6917411Z 2025-05-07T20:24:16.6917414Z 2025-05-07T20:24:16.6917620Z 2025-05-07T20:24:16.6917624Z 2025-05-07T20:24:16.6917628Z 2025-05-07T20:24:16.6917631Z 2025-05-07T20:24:16.6917635Z 2025-05-07T20:24:16.6917819Z  2025-05-07T20:24:16.6918045Z 2025-05-07T20:24:16.6918049Z 2025-05-07T20:24:16.6918052Z 2025-05-07T20:24:16.6918056Z 2025-05-07T20:24:16.6918060Z 2025-05-07T20:24:16.6918070Z 2025-05-07T20:24:16.6918074Z 2025-05-07T20:24:16.6918077Z 2025-05-07T20:24:16.6918081Z 2025-05-07T20:24:16.6918423Z  2025-05-07T20:24:16.6918641Z 2025-05-07T20:24:16.6918644Z 2025-05-07T20:24:16.6918648Z 2025-05-07T20:24:16.6918657Z 2025-05-07T20:24:16.6918661Z 2025-05-07T20:24:16.6918664Z 2025-05-07T20:24:16.6918668Z 2025-05-07T20:24:16.6918671Z 2025-05-07T20:24:16.6918675Z 2025-05-07T20:24:16.6918678Z 2025-05-07T20:24:16.6918868Z  2025-05-07T20:24:16.6919099Z 2025-05-07T20:24:16.6919103Z 2025-05-07T20:24:16.6919107Z 2025-05-07T20:24:16.6919110Z 2025-05-07T20:24:16.6919114Z 2025-05-07T20:24:16.6919117Z 2025-05-07T20:24:16.6919121Z 2025-05-07T20:24:16.6919125Z 2025-05-07T20:24:16.6919185Z 2025-05-07T20:24:16.6919189Z 2025-05-07T20:24:16.6919192Z 2025-05-07T20:24:16.6919388Z  2025-05-07T20:24:16.6919608Z 2025-05-07T20:24:16.6919612Z 2025-05-07T20:24:16.6919616Z 2025-05-07T20:24:16.6919619Z 2025-05-07T20:24:16.6919623Z 2025-05-07T20:24:16.6919632Z 2025-05-07T20:24:16.6919636Z 2025-05-07T20:24:16.6919639Z 2025-05-07T20:24:16.6919643Z 2025-05-07T20:24:16.6919646Z 2025-05-07T20:24:16.6919650Z 2025-05-07T20:24:16.6919653Z 2025-05-07T20:24:16.6919852Z  2025-05-07T20:24:16.6920076Z 2025-05-07T20:24:16.6920079Z 2025-05-07T20:24:16.6920088Z 2025-05-07T20:24:16.6920092Z 2025-05-07T20:24:16.6920095Z 2025-05-07T20:24:16.6920099Z 2025-05-07T20:24:16.6920108Z 2025-05-07T20:24:16.6920111Z 2025-05-07T20:24:16.6920115Z 2025-05-07T20:24:16.6920119Z 2025-05-07T20:24:16.6920122Z 2025-05-07T20:24:16.6920126Z 2025-05-07T20:24:16.6920129Z 2025-05-07T20:24:16.6920325Z  2025-05-07T20:24:16.6920554Z 2025-05-07T20:24:16.6920557Z 2025-05-07T20:24:16.6920561Z 2025-05-07T20:24:16.6920565Z 2025-05-07T20:24:16.6920568Z 2025-05-07T20:24:16.6920572Z 2025-05-07T20:24:16.6920580Z 2025-05-07T20:24:16.6920584Z 2025-05-07T20:24:16.6920588Z 2025-05-07T20:24:16.6920591Z 2025-05-07T20:24:16.6920595Z 2025-05-07T20:24:16.6920599Z 2025-05-07T20:24:16.6920602Z 2025-05-07T20:24:16.6920606Z 2025-05-07T20:24:16.6920810Z  2025-05-07T20:24:16.6921042Z 2025-05-07T20:24:16.6921050Z 2025-05-07T20:24:16.6921054Z 2025-05-07T20:24:16.6921057Z 2025-05-07T20:24:16.6921061Z 2025-05-07T20:24:16.6921064Z 2025-05-07T20:24:16.6921068Z 2025-05-07T20:24:16.6921071Z 2025-05-07T20:24:16.6921075Z 2025-05-07T20:24:16.6921079Z 2025-05-07T20:24:16.6921082Z 2025-05-07T20:24:16.6921086Z 2025-05-07T20:24:16.6921089Z 2025-05-07T20:24:16.6921093Z 2025-05-07T20:24:16.6921096Z 2025-05-07T20:24:16.6921308Z  2025-05-07T20:24:16.6921536Z 2025-05-07T20:24:16.6921540Z 2025-05-07T20:24:16.6921544Z 2025-05-07T20:24:16.6921551Z 2025-05-07T20:24:16.6921555Z 2025-05-07T20:24:16.6921559Z 2025-05-07T20:24:16.6921563Z 2025-05-07T20:24:16.6921572Z 2025-05-07T20:24:16.6921576Z 2025-05-07T20:24:16.6921579Z 2025-05-07T20:24:16.6921583Z 2025-05-07T20:24:16.6921586Z 2025-05-07T20:24:16.6921590Z 2025-05-07T20:24:16.6921593Z 2025-05-07T20:24:16.6921597Z 2025-05-07T20:24:16.6921600Z 2025-05-07T20:24:16.6921893Z  2025-05-07T20:24:16.6922131Z 2025-05-07T20:24:16.6922135Z 2025-05-07T20:24:16.6922139Z 2025-05-07T20:24:16.6922142Z 2025-05-07T20:24:16.6922146Z 2025-05-07T20:24:16.6922149Z 2025-05-07T20:24:16.6922153Z 2025-05-07T20:24:16.6922156Z 2025-05-07T20:24:16.6922160Z 2025-05-07T20:24:16.6922163Z 2025-05-07T20:24:16.6922167Z 2025-05-07T20:24:16.6922170Z 2025-05-07T20:24:16.6922174Z 2025-05-07T20:24:16.6922177Z 2025-05-07T20:24:16.6922181Z 2025-05-07T20:24:16.6922184Z 2025-05-07T20:24:16.6922274Z 2025-05-07T20:24:16.6922497Z  2025-05-07T20:24:16.6922728Z 2025-05-07T20:24:16.6922732Z 2025-05-07T20:24:16.6922735Z 2025-05-07T20:24:16.6922739Z 2025-05-07T20:24:16.6922743Z 2025-05-07T20:24:16.6922746Z 2025-05-07T20:24:16.6922750Z 2025-05-07T20:24:16.6922753Z 2025-05-07T20:24:16.6922757Z 2025-05-07T20:24:16.6922765Z 2025-05-07T20:24:16.6922773Z 2025-05-07T20:24:16.6922777Z 2025-05-07T20:24:16.6922780Z 2025-05-07T20:24:16.6922784Z 2025-05-07T20:24:16.6922788Z 2025-05-07T20:24:16.6922791Z 2025-05-07T20:24:16.6922795Z 2025-05-07T20:24:16.6922798Z 2025-05-07T20:24:16.6923015Z  2025-05-07T20:24:16.6923258Z 2025-05-07T20:24:16.6923338Z done 2025-05-07T20:24:16.7936786Z Preparing transaction: \ done 2025-05-07T20:24:17.5420533Z Verifying transaction: / - \ | / - \ done 2025-05-07T20:24:19.1450508Z Executing transaction: / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:24:19.4967480Z [SETUP] Testing pyOpenSSL import ... 2025-05-07T20:24:21.2406096Z [CHECK] Python (sub-)package 'OpenSSL' found ... 2025-05-07T20:24:21.2419541Z [SETUP] Installing libxcrypt ... 2025-05-07T20:24:21.2442573Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt 2025-05-07T20:24:22.1089260Z Channels: 2025-05-07T20:24:22.1089517Z - conda-forge 2025-05-07T20:24:22.1089834Z Platform: linux-64 2025-05-07T20:24:25.4819729Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:25.8526875Z Solving environment: \ done 2025-05-07T20:24:25.8895246Z 2025-05-07T20:24:25.8895574Z # All requested packages already installed. 2025-05-07T20:24:25.8895842Z 2025-05-07T20:24:29.2395078Z [SETUP] Copying over ... 2025-05-07T20:24:29.2395929Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.12/crypt.h 2025-05-07T20:24:29.2396504Z 2025-05-07T20:24:29.2421204Z 2025-05-07T20:24:30.8662996Z [SETUP] Installed Python version: Python 3.12.2 2025-05-07T20:24:30.8663462Z [SETUP] Successfully created Conda environment: build_binary 2025-05-07T20:24:30.8706775Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:24:30.8707420Z . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:24:30.8719215Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:24:30.8719566Z env: 2025-05-07T20:24:30.8719796Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:24:30.8720104Z BUILD_ENV: build_binary 2025-05-07T20:24:30.8720391Z BUILD_TARGET: genai 2025-05-07T20:24:30.8720626Z BUILD_VARIANT: cuda 2025-05-07T20:24:30.8720864Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:24:30.8721118Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:24:30.8721425Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:24:30.8721760Z ##[endgroup] 2025-05-07T20:24:31.2085589Z ################################################################################ 2025-05-07T20:24:31.2085966Z # Install C/C++ Compilers 2025-05-07T20:24:31.2086209Z # 2025-05-07T20:24:31.2101756Z # [2025-05-07T20:24:31.209Z] + install_cxx_compiler build_binary gcc 2025-05-07T20:24:31.2102290Z ################################################################################ 2025-05-07T20:24:31.2102872Z 2025-05-07T20:24:31.2118918Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:24:31.3002005Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:24:31.3014718Z [INSTALL] Installing GLIBC (architecture = 64) ... 2025-05-07T20:24:31.3035696Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17 2025-05-07T20:24:32.1668566Z Channels: 2025-05-07T20:24:32.1668875Z - conda-forge 2025-05-07T20:24:32.1669100Z Platform: linux-64 2025-05-07T20:24:35.4257018Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:35.7901087Z Solving environment: \ done 2025-05-07T20:24:35.8540621Z 2025-05-07T20:24:35.8540937Z ## Package Plan ## 2025-05-07T20:24:35.8541113Z 2025-05-07T20:24:35.8541367Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:35.8541677Z 2025-05-07T20:24:35.8541792Z added / updated specs: 2025-05-07T20:24:35.8542062Z - sysroot_linux-64=2.17 2025-05-07T20:24:35.8542230Z 2025-05-07T20:24:35.8542244Z 2025-05-07T20:24:35.8542366Z The following packages will be downloaded: 2025-05-07T20:24:35.8542585Z 2025-05-07T20:24:35.8542708Z package | build 2025-05-07T20:24:35.8543028Z ---------------------------|----------------- 2025-05-07T20:24:35.8543460Z kernel-headers_linux-64-3.10.0| he073ed8_18 921 KB conda-forge 2025-05-07T20:24:35.8544063Z sysroot_linux-64-2.17 | h0157908_18 14.5 MB conda-forge 2025-05-07T20:24:35.8544475Z ------------------------------------------------------------ 2025-05-07T20:24:35.8544825Z Total: 15.4 MB 2025-05-07T20:24:35.8545041Z 2025-05-07T20:24:35.8545169Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:35.8545397Z 2025-05-07T20:24:35.8545693Z kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 2025-05-07T20:24:35.8546262Z sysroot_linux-64 conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 2025-05-07T20:24:35.8546586Z 2025-05-07T20:24:35.8546590Z 2025-05-07T20:24:35.8546594Z 2025-05-07T20:24:35.8546737Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:35.8547111Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:24:35.8547341Z 2025-05-07T20:24:35.9544038Z kernel-headers_linux | 921 KB | | 0%  2025-05-07T20:24:36.0067038Z sysroot_linux-64-2.1 | 14.5 MB | #8 | 19% 2025-05-07T20:24:36.0067291Z 2025-05-07T20:24:36.0154599Z kernel-headers_linux | 921 KB | 1 | 2%  2025-05-07T20:24:36.0157474Z 2025-05-07T20:24:36.0526475Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:24:36.3231605Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:24:36.3232291Z 2025-05-07T20:24:36.3232705Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:24:36.3233012Z 2025-05-07T20:24:36.7284848Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:24:36.7285988Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:24:36.7290622Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:24:36.7291275Z 2025-05-07T20:24:36.7291615Z 2025-05-07T20:24:36.7291862Z  done 2025-05-07T20:24:36.8295396Z Preparing transaction: / done 2025-05-07T20:24:37.0309755Z Verifying transaction: \ | done 2025-05-07T20:24:37.2361275Z Executing transaction: - \ done 2025-05-07T20:24:37.3906414Z [CHECK] LD_LIBRARY_PATH = 2025-05-07T20:24:37.3906806Z [CHECK] CONDA_PREFIX is not set. 2025-05-07T20:24:39.0610808Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6 2025-05-07T20:24:39.0623218Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ... 2025-05-07T20:24:39.0646439Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0 2025-05-07T20:24:39.9518700Z Channels: 2025-05-07T20:24:39.9518946Z - conda-forge 2025-05-07T20:24:39.9519168Z Platform: linux-64 2025-05-07T20:24:43.2247185Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:44.1841890Z Solving environment: \ | / done 2025-05-07T20:24:44.2494374Z 2025-05-07T20:24:44.2494764Z ## Package Plan ## 2025-05-07T20:24:44.2494955Z 2025-05-07T20:24:44.2495178Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:44.2495494Z 2025-05-07T20:24:44.2495592Z added / updated specs: 2025-05-07T20:24:44.2495859Z - gxx_linux-64=11.4.0 2025-05-07T20:24:44.2496027Z 2025-05-07T20:24:44.2496032Z 2025-05-07T20:24:44.2496154Z The following packages will be downloaded: 2025-05-07T20:24:44.2496398Z 2025-05-07T20:24:44.2496522Z package | build 2025-05-07T20:24:44.2496968Z ---------------------------|----------------- 2025-05-07T20:24:44.2497543Z binutils_impl_linux-64-2.40| ha1999f0_7 6.0 MB conda-forge 2025-05-07T20:24:44.2498215Z binutils_linux-64-2.40 | hb3c18ed_4 28 KB conda-forge 2025-05-07T20:24:44.2498859Z gcc_impl_linux-64-11.4.0 | h00c12a0_13 53.0 MB conda-forge 2025-05-07T20:24:44.2499436Z gcc_linux-64-11.4.0 | ha077dfb_4 31 KB conda-forge 2025-05-07T20:24:44.2500015Z gxx_impl_linux-64-11.4.0 | h634f3ee_13 11.2 MB conda-forge 2025-05-07T20:24:44.2500471Z gxx_linux-64-11.4.0 | h35bfe5d_4 29 KB conda-forge 2025-05-07T20:24:44.2500908Z ld_impl_linux-64-2.40 | hf3520f5_7 691 KB conda-forge 2025-05-07T20:24:44.2501391Z libgcc-devel_linux-64-11.4.0| h8f596e0_113 2.3 MB conda-forge 2025-05-07T20:24:44.2501876Z libsanitizer-11.4.0 | h5763a12_13 3.5 MB conda-forge 2025-05-07T20:24:44.2502334Z libstdcxx-15.1.0 | h8f9b012_2 3.7 MB conda-forge 2025-05-07T20:24:44.2503200Z libstdcxx-devel_linux-64-11.4.0| h8f596e0_113 11.1 MB conda-forge 2025-05-07T20:24:44.2503862Z libstdcxx-ng-15.1.0 | h4852527_2 34 KB conda-forge 2025-05-07T20:24:44.2504274Z ------------------------------------------------------------ 2025-05-07T20:24:44.2504619Z Total: 91.6 MB 2025-05-07T20:24:44.2504834Z 2025-05-07T20:24:44.2504962Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:44.2505187Z 2025-05-07T20:24:44.2505459Z binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 2025-05-07T20:24:44.2506448Z binutils_linux-64 conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 2025-05-07T20:24:44.2507582Z gcc_impl_linux-64 conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 2025-05-07T20:24:44.2508126Z gcc_linux-64 conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 2025-05-07T20:24:44.2508639Z gxx_impl_linux-64 conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 2025-05-07T20:24:44.2509149Z gxx_linux-64 conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 2025-05-07T20:24:44.2509683Z libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:24:44.2510271Z libsanitizer conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 2025-05-07T20:24:44.2510775Z libstdcxx conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 2025-05-07T20:24:44.2511323Z libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:24:44.2511691Z 2025-05-07T20:24:44.2511805Z The following packages will be UPDATED: 2025-05-07T20:24:44.2512017Z 2025-05-07T20:24:44.2512342Z ld_impl_linux-64 pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 2025-05-07T20:24:44.2513236Z libstdcxx-ng pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 2025-05-07T20:24:44.2513653Z 2025-05-07T20:24:44.2513658Z 2025-05-07T20:24:44.2513668Z 2025-05-07T20:24:44.2513825Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:44.2514235Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:24:44.2514465Z 2025-05-07T20:24:44.2514829Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:24:44.2515078Z 2025-05-07T20:24:44.2515082Z 2025-05-07T20:24:44.2528082Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:24:44.2528468Z 2025-05-07T20:24:44.2528504Z 2025-05-07T20:24:44.2528533Z 2025-05-07T20:24:44.2535072Z binutils_impl_linux- | 6.0 MB | | 0%  2025-05-07T20:24:44.2535460Z 2025-05-07T20:24:44.2535466Z 2025-05-07T20:24:44.2535471Z 2025-05-07T20:24:44.2540765Z 2025-05-07T20:24:44.2577111Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:24:44.2577506Z 2025-05-07T20:24:44.2577512Z 2025-05-07T20:24:44.2577517Z 2025-05-07T20:24:44.2577522Z 2025-05-07T20:24:44.2586311Z 2025-05-07T20:24:44.2605917Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:24:44.2606519Z 2025-05-07T20:24:44.2606525Z 2025-05-07T20:24:44.2606530Z 2025-05-07T20:24:44.2606535Z 2025-05-07T20:24:44.2606540Z 2025-05-07T20:24:44.2606546Z 2025-05-07T20:24:44.2607513Z libgcc-devel_linux-6 | 2.3 MB | | 0%  2025-05-07T20:24:44.2607916Z 2025-05-07T20:24:44.2607921Z 2025-05-07T20:24:44.2607926Z 2025-05-07T20:24:44.2607931Z 2025-05-07T20:24:44.2607937Z 2025-05-07T20:24:44.2607946Z 2025-05-07T20:24:44.2607951Z 2025-05-07T20:24:44.2609353Z ld_impl_linux-64-2.4 | 691 KB | | 0%  2025-05-07T20:24:44.2609764Z 2025-05-07T20:24:44.2609770Z 2025-05-07T20:24:44.2609785Z 2025-05-07T20:24:44.2609790Z 2025-05-07T20:24:44.2609795Z 2025-05-07T20:24:44.2609806Z 2025-05-07T20:24:44.2609812Z 2025-05-07T20:24:44.2609817Z 2025-05-07T20:24:44.2611411Z libstdcxx-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:24:44.2611816Z 2025-05-07T20:24:44.2611822Z 2025-05-07T20:24:44.2611826Z 2025-05-07T20:24:44.2611832Z 2025-05-07T20:24:44.2611837Z 2025-05-07T20:24:44.2611842Z 2025-05-07T20:24:44.2611847Z 2025-05-07T20:24:44.2611852Z 2025-05-07T20:24:44.2611858Z 2025-05-07T20:24:44.2625154Z gcc_linux-64-11.4.0 | 31 KB | | 0%  2025-05-07T20:24:44.2625570Z 2025-05-07T20:24:44.2625576Z 2025-05-07T20:24:44.2625581Z 2025-05-07T20:24:44.2625586Z 2025-05-07T20:24:44.2625591Z 2025-05-07T20:24:44.2625596Z 2025-05-07T20:24:44.2625601Z 2025-05-07T20:24:44.2625606Z 2025-05-07T20:24:44.2625620Z 2025-05-07T20:24:44.2625626Z 2025-05-07T20:24:44.2626848Z gxx_linux-64-11.4.0 | 29 KB | | 0%  2025-05-07T20:24:44.2627261Z 2025-05-07T20:24:44.2627274Z 2025-05-07T20:24:44.2627289Z 2025-05-07T20:24:44.2627294Z 2025-05-07T20:24:44.2627300Z 2025-05-07T20:24:44.2627305Z 2025-05-07T20:24:44.2627310Z 2025-05-07T20:24:44.2627315Z 2025-05-07T20:24:44.2627320Z 2025-05-07T20:24:44.2627325Z 2025-05-07T20:24:44.2627330Z 2025-05-07T20:24:44.3885892Z binutils_linux-64-2. | 28 KB | | 0%  2025-05-07T20:24:44.3886330Z 2025-05-07T20:24:44.3886336Z 2025-05-07T20:24:44.3887123Z 2025-05-07T20:24:44.4573966Z binutils_impl_linux- | 6.0 MB | 2 | 2%  2025-05-07T20:24:44.4574368Z 2025-05-07T20:24:44.4574374Z 2025-05-07T20:24:44.4574379Z 2025-05-07T20:24:44.4574384Z 2025-05-07T20:24:44.5113910Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:24:44.5114248Z 2025-05-07T20:24:44.5114252Z 2025-05-07T20:24:44.5117643Z 2025-05-07T20:24:44.5515084Z binutils_impl_linux- | 6.0 MB | 3 | 4%  2025-05-07T20:24:44.5515512Z 2025-05-07T20:24:44.5579830Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:24:44.5580094Z 2025-05-07T20:24:44.5580098Z 2025-05-07T20:24:44.5580102Z 2025-05-07T20:24:44.5580699Z 2025-05-07T20:24:44.5601650Z libstdcxx-15.1.0 | 3.7 MB | ###8 | 38%  2025-05-07T20:24:44.5727043Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:24:44.5727411Z 2025-05-07T20:24:44.5727417Z 2025-05-07T20:24:44.6117076Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:24:44.6117464Z 2025-05-07T20:24:44.6117468Z 2025-05-07T20:24:44.6117472Z 2025-05-07T20:24:44.6515526Z binutils_impl_linux- | 6.0 MB | ########6 | 86%  2025-05-07T20:24:44.6515914Z 2025-05-07T20:24:44.6601739Z gxx_impl_linux-64-11 | 11.2 MB | ### | 31%  2025-05-07T20:24:44.6732594Z gcc_impl_linux-64-11 | 53.0 MB | 6 | 6% 2025-05-07T20:24:44.6732880Z 2025-05-07T20:24:44.6736983Z 2025-05-07T20:24:44.7048101Z libstdcxx-devel_linu | 11.1 MB | ### | 31%  2025-05-07T20:24:44.7048468Z 2025-05-07T20:24:44.7048472Z 2025-05-07T20:24:44.7048476Z 2025-05-07T20:24:44.7050512Z 2025-05-07T20:24:44.7052568Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:24:44.7052953Z 2025-05-07T20:24:44.7052956Z 2025-05-07T20:24:44.7052960Z 2025-05-07T20:24:44.7054519Z 2025-05-07T20:24:44.7516489Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:24:44.7518089Z 2025-05-07T20:24:44.7604818Z gxx_impl_linux-64-11 | 11.2 MB | #######2 | 72%  2025-05-07T20:24:44.7684412Z gcc_impl_linux-64-11 | 53.0 MB | #6 | 16% 2025-05-07T20:24:44.7684661Z 2025-05-07T20:24:44.7684901Z 2025-05-07T20:24:44.7689379Z 2025-05-07T20:24:44.7768704Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:24:44.7769058Z 2025-05-07T20:24:44.7769064Z 2025-05-07T20:24:44.7769069Z 2025-05-07T20:24:44.7769087Z 2025-05-07T20:24:44.7769800Z 2025-05-07T20:24:44.7971587Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:24:44.7971915Z 2025-05-07T20:24:44.7974057Z 2025-05-07T20:24:44.8262230Z libstdcxx-devel_linu | 11.1 MB | ####8 | 49%  2025-05-07T20:24:44.8262617Z 2025-05-07T20:24:44.8262625Z 2025-05-07T20:24:44.8262765Z 2025-05-07T20:24:44.8262770Z 2025-05-07T20:24:44.8266195Z 2025-05-07T20:24:44.8266201Z 2025-05-07T20:24:44.8776157Z libgcc-devel_linux-6 | 2.3 MB | | 1%  2025-05-07T20:24:44.8776583Z 2025-05-07T20:24:44.8776590Z 2025-05-07T20:24:44.8776595Z 2025-05-07T20:24:44.8776600Z 2025-05-07T20:24:44.8777345Z 2025-05-07T20:24:44.8790117Z libsanitizer-11.4.0 | 3.5 MB | ########1 | 82%  2025-05-07T20:24:44.8977075Z gcc_impl_linux-64-11 | 53.0 MB | ##2 | 22% 2025-05-07T20:24:44.8977423Z 2025-05-07T20:24:44.8978182Z 2025-05-07T20:24:44.9694013Z libstdcxx-devel_linu | 11.1 MB | #######5 | 75%  2025-05-07T20:24:44.9694420Z 2025-05-07T20:24:44.9694426Z 2025-05-07T20:24:44.9694442Z 2025-05-07T20:24:44.9694447Z 2025-05-07T20:24:44.9694466Z 2025-05-07T20:24:44.9694471Z 2025-05-07T20:24:44.9694838Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:24:44.9695205Z 2025-05-07T20:24:44.9695209Z 2025-05-07T20:24:44.9695213Z 2025-05-07T20:24:44.9695217Z 2025-05-07T20:24:44.9695221Z 2025-05-07T20:24:44.9695232Z 2025-05-07T20:24:44.9805144Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:24:45.0184440Z gcc_impl_linux-64-11 | 53.0 MB | ##8 | 29% 2025-05-07T20:24:45.0184795Z 2025-05-07T20:24:45.0184801Z 2025-05-07T20:24:45.0184806Z 2025-05-07T20:24:45.0184811Z 2025-05-07T20:24:45.0184816Z 2025-05-07T20:24:45.0184821Z 2025-05-07T20:24:45.0187438Z 2025-05-07T20:24:45.0242206Z ld_impl_linux-64-2.4 | 691 KB | 2 | 2%  2025-05-07T20:24:45.0242603Z 2025-05-07T20:24:45.0242621Z 2025-05-07T20:24:45.0242626Z 2025-05-07T20:24:45.0242631Z 2025-05-07T20:24:45.0242863Z 2025-05-07T20:24:45.0709407Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:24:45.0709838Z 2025-05-07T20:24:45.0709845Z 2025-05-07T20:24:45.0709851Z 2025-05-07T20:24:45.0709857Z 2025-05-07T20:24:45.0709863Z 2025-05-07T20:24:45.0709868Z 2025-05-07T20:24:45.0709873Z 2025-05-07T20:24:45.0775570Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:24:45.0775973Z 2025-05-07T20:24:45.0775979Z 2025-05-07T20:24:45.0775984Z 2025-05-07T20:24:45.0775989Z 2025-05-07T20:24:45.0775994Z 2025-05-07T20:24:45.0775999Z 2025-05-07T20:24:45.0776004Z 2025-05-07T20:24:45.0776386Z 2025-05-07T20:24:45.0811472Z libstdcxx-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:24:45.0818683Z gcc_impl_linux-64-11 | 53.0 MB | ###4 | 35% 2025-05-07T20:24:45.0819028Z 2025-05-07T20:24:45.0819034Z 2025-05-07T20:24:45.0819076Z 2025-05-07T20:24:45.0819082Z 2025-05-07T20:24:45.0819087Z 2025-05-07T20:24:45.0819102Z 2025-05-07T20:24:45.0819108Z 2025-05-07T20:24:45.0819130Z 2025-05-07T20:24:45.1384264Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:45.1384687Z 2025-05-07T20:24:45.1384694Z 2025-05-07T20:24:45.1384699Z 2025-05-07T20:24:45.1384704Z 2025-05-07T20:24:45.1384709Z 2025-05-07T20:24:45.1384714Z 2025-05-07T20:24:45.1384719Z 2025-05-07T20:24:45.1384724Z 2025-05-07T20:24:45.1385976Z 2025-05-07T20:24:45.1417358Z gcc_linux-64-11.4.0 | 31 KB | #####2 | 52%  2025-05-07T20:24:45.1417773Z 2025-05-07T20:24:45.1417779Z 2025-05-07T20:24:45.1417784Z 2025-05-07T20:24:45.1417789Z 2025-05-07T20:24:45.1417794Z 2025-05-07T20:24:45.1417800Z 2025-05-07T20:24:45.1417816Z 2025-05-07T20:24:45.1417822Z 2025-05-07T20:24:45.1417982Z 2025-05-07T20:24:45.1433876Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:24:45.1434273Z 2025-05-07T20:24:45.1434279Z 2025-05-07T20:24:45.1434291Z 2025-05-07T20:24:45.1434297Z 2025-05-07T20:24:45.1434302Z 2025-05-07T20:24:45.1434307Z 2025-05-07T20:24:45.1434312Z 2025-05-07T20:24:45.1434329Z 2025-05-07T20:24:45.1434334Z 2025-05-07T20:24:45.1434339Z 2025-05-07T20:24:45.1487998Z gxx_linux-64-11.4.0 | 29 KB | #####5 | 55%  2025-05-07T20:24:45.1488478Z 2025-05-07T20:24:45.1488483Z 2025-05-07T20:24:45.1488496Z 2025-05-07T20:24:45.1488501Z 2025-05-07T20:24:45.1488506Z 2025-05-07T20:24:45.1488511Z 2025-05-07T20:24:45.1488517Z 2025-05-07T20:24:45.1488521Z 2025-05-07T20:24:45.1488526Z 2025-05-07T20:24:45.1489200Z 2025-05-07T20:24:45.1811765Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:24:45.1845103Z gcc_impl_linux-64-11 | 53.0 MB | ####1 | 42% 2025-05-07T20:24:45.1845446Z 2025-05-07T20:24:45.1845463Z 2025-05-07T20:24:45.1845469Z 2025-05-07T20:24:45.1845717Z 2025-05-07T20:24:45.1845724Z 2025-05-07T20:24:45.1845730Z 2025-05-07T20:24:45.1845748Z 2025-05-07T20:24:45.1845754Z 2025-05-07T20:24:45.1845759Z 2025-05-07T20:24:45.1845765Z 2025-05-07T20:24:45.1845770Z 2025-05-07T20:24:45.1881044Z binutils_linux-64-2. | 28 KB | #####6 | 56%  2025-05-07T20:24:45.1881481Z 2025-05-07T20:24:45.1881486Z 2025-05-07T20:24:45.1881491Z 2025-05-07T20:24:45.1881496Z 2025-05-07T20:24:45.1881501Z 2025-05-07T20:24:45.1881507Z 2025-05-07T20:24:45.1881512Z 2025-05-07T20:24:45.1881517Z 2025-05-07T20:24:45.1881522Z 2025-05-07T20:24:45.1881527Z 2025-05-07T20:24:45.1881532Z 2025-05-07T20:24:45.1918560Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:24:45.1922763Z 2025-05-07T20:24:45.2050297Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:24:45.2050652Z 2025-05-07T20:24:45.2050658Z 2025-05-07T20:24:45.2050664Z 2025-05-07T20:24:45.2051187Z 2025-05-07T20:24:45.2814402Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:24:45.2894180Z gcc_impl_linux-64-11 | 53.0 MB | ####9 | 50% 2025-05-07T20:24:45.2894428Z 2025-05-07T20:24:45.2895939Z 2025-05-07T20:24:45.2904488Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:24:45.2904761Z 2025-05-07T20:24:45.2905925Z 2025-05-07T20:24:45.3230315Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:24:45.3230717Z 2025-05-07T20:24:45.3230724Z 2025-05-07T20:24:45.3230729Z 2025-05-07T20:24:45.3230735Z 2025-05-07T20:24:45.3230740Z 2025-05-07T20:24:45.3230746Z 2025-05-07T20:24:45.3545820Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:24:45.3546258Z 2025-05-07T20:24:45.3546263Z 2025-05-07T20:24:45.3546268Z 2025-05-07T20:24:45.3546273Z 2025-05-07T20:24:45.3546278Z 2025-05-07T20:24:45.3546284Z 2025-05-07T20:24:45.3546289Z 2025-05-07T20:24:45.3548692Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:24:45.3549070Z 2025-05-07T20:24:45.3549085Z 2025-05-07T20:24:45.3549089Z 2025-05-07T20:24:45.3549092Z 2025-05-07T20:24:45.3549096Z 2025-05-07T20:24:45.3549100Z 2025-05-07T20:24:45.3549103Z 2025-05-07T20:24:45.3962789Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:24:45.4387011Z gcc_impl_linux-64-11 | 53.0 MB | #####6 | 56% 2025-05-07T20:24:45.4387277Z 2025-05-07T20:24:45.4387281Z 2025-05-07T20:24:45.4387285Z 2025-05-07T20:24:45.4387288Z 2025-05-07T20:24:45.4387292Z 2025-05-07T20:24:45.4387296Z 2025-05-07T20:24:45.4387300Z 2025-05-07T20:24:45.4390106Z 2025-05-07T20:24:45.4394321Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:45.4394788Z 2025-05-07T20:24:45.4394795Z 2025-05-07T20:24:45.4394800Z 2025-05-07T20:24:45.4394805Z 2025-05-07T20:24:45.4394810Z 2025-05-07T20:24:45.4394815Z 2025-05-07T20:24:45.4394820Z 2025-05-07T20:24:45.4395127Z 2025-05-07T20:24:45.4963439Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:45.5020801Z gcc_impl_linux-64-11 | 53.0 MB | ######5 | 65% 2025-05-07T20:24:45.5021185Z 2025-05-07T20:24:45.5021191Z 2025-05-07T20:24:45.5021197Z 2025-05-07T20:24:45.5021201Z 2025-05-07T20:24:45.5021206Z 2025-05-07T20:24:45.5021211Z 2025-05-07T20:24:45.5021216Z 2025-05-07T20:24:45.5021221Z 2025-05-07T20:24:45.5021226Z 2025-05-07T20:24:45.5026010Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:24:45.5026433Z 2025-05-07T20:24:45.5026439Z 2025-05-07T20:24:45.5026444Z 2025-05-07T20:24:45.5026450Z 2025-05-07T20:24:45.5026454Z 2025-05-07T20:24:45.5026459Z 2025-05-07T20:24:45.5026464Z 2025-05-07T20:24:45.5026469Z 2025-05-07T20:24:45.5026474Z 2025-05-07T20:24:45.5662402Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:24:45.5662859Z 2025-05-07T20:24:45.5662866Z 2025-05-07T20:24:45.5663192Z 2025-05-07T20:24:45.5663202Z 2025-05-07T20:24:45.5663207Z 2025-05-07T20:24:45.5663226Z 2025-05-07T20:24:45.5663232Z 2025-05-07T20:24:45.5663237Z 2025-05-07T20:24:45.5663256Z 2025-05-07T20:24:45.5663824Z 2025-05-07T20:24:45.5669548Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:24:45.5669987Z 2025-05-07T20:24:45.5670005Z 2025-05-07T20:24:45.5670010Z 2025-05-07T20:24:45.5670015Z 2025-05-07T20:24:45.5670020Z 2025-05-07T20:24:45.5670026Z 2025-05-07T20:24:45.5670031Z 2025-05-07T20:24:45.5670035Z 2025-05-07T20:24:45.5670039Z 2025-05-07T20:24:45.5670042Z 2025-05-07T20:24:45.5934199Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:24:45.5934519Z 2025-05-07T20:24:45.5934523Z 2025-05-07T20:24:45.5934527Z 2025-05-07T20:24:45.5934531Z 2025-05-07T20:24:45.5934534Z 2025-05-07T20:24:45.5964291Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:24:45.6256096Z gcc_impl_linux-64-11 | 53.0 MB | #######3 | 74% 2025-05-07T20:24:45.6256579Z 2025-05-07T20:24:45.6256592Z 2025-05-07T20:24:45.6256596Z 2025-05-07T20:24:45.6256599Z 2025-05-07T20:24:45.6256603Z 2025-05-07T20:24:45.6256607Z 2025-05-07T20:24:45.6256610Z 2025-05-07T20:24:45.6256614Z 2025-05-07T20:24:45.6256617Z 2025-05-07T20:24:45.6256621Z 2025-05-07T20:24:45.6256624Z 2025-05-07T20:24:45.6265872Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:24:45.6266184Z 2025-05-07T20:24:45.6266188Z 2025-05-07T20:24:45.6266192Z 2025-05-07T20:24:45.6266195Z 2025-05-07T20:24:45.6266199Z 2025-05-07T20:24:45.6266202Z 2025-05-07T20:24:45.6266206Z 2025-05-07T20:24:45.6266209Z 2025-05-07T20:24:45.6266213Z 2025-05-07T20:24:45.6266216Z 2025-05-07T20:24:45.6267176Z 2025-05-07T20:24:45.6966859Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:24:45.7663536Z gcc_impl_linux-64-11 | 53.0 MB | ########1 | 81% 2025-05-07T20:24:45.7663802Z 2025-05-07T20:24:45.7663807Z 2025-05-07T20:24:45.7665369Z 2025-05-07T20:24:45.7968312Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:24:46.0437922Z gcc_impl_linux-64-11 | 53.0 MB | ######### | 91% 2025-05-07T20:24:46.0438461Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:24:46.0699689Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:24:46.0701975Z 2025-05-07T20:24:46.3130933Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:24:46.3131422Z 2025-05-07T20:24:46.3131437Z 2025-05-07T20:24:46.7440240Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:24:46.7446656Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:24:46.7447099Z 2025-05-07T20:24:46.7447393Z 2025-05-07T20:24:46.7447589Z  2025-05-07T20:24:46.7447828Z 2025-05-07T20:24:46.7447833Z 2025-05-07T20:24:46.7447994Z  2025-05-07T20:24:46.7448219Z 2025-05-07T20:24:46.7448223Z 2025-05-07T20:24:46.7448226Z 2025-05-07T20:24:46.7448395Z  2025-05-07T20:24:46.7448598Z 2025-05-07T20:24:46.7448602Z 2025-05-07T20:24:46.7448616Z 2025-05-07T20:24:46.7448626Z 2025-05-07T20:24:46.7448793Z  2025-05-07T20:24:46.7449001Z 2025-05-07T20:24:46.7449004Z 2025-05-07T20:24:46.7449017Z 2025-05-07T20:24:46.7449023Z 2025-05-07T20:24:46.7449028Z 2025-05-07T20:24:46.7449269Z  2025-05-07T20:24:46.7449575Z 2025-05-07T20:24:46.7449580Z 2025-05-07T20:24:46.7449584Z 2025-05-07T20:24:46.7449589Z 2025-05-07T20:24:46.7449607Z 2025-05-07T20:24:46.7449613Z 2025-05-07T20:24:46.7450124Z  2025-05-07T20:24:46.7450350Z 2025-05-07T20:24:46.7450354Z 2025-05-07T20:24:46.7450363Z 2025-05-07T20:24:46.7450367Z 2025-05-07T20:24:46.7450371Z 2025-05-07T20:24:46.7450381Z 2025-05-07T20:24:46.7450384Z 2025-05-07T20:24:46.7450583Z  2025-05-07T20:24:46.7450796Z 2025-05-07T20:24:46.7450806Z 2025-05-07T20:24:46.7450809Z 2025-05-07T20:24:46.7450813Z 2025-05-07T20:24:46.7450816Z 2025-05-07T20:24:46.7450820Z 2025-05-07T20:24:46.7450824Z 2025-05-07T20:24:46.7450827Z 2025-05-07T20:24:46.7451015Z  2025-05-07T20:24:46.7451312Z 2025-05-07T20:24:46.7451318Z 2025-05-07T20:24:46.7451323Z 2025-05-07T20:24:46.7451328Z 2025-05-07T20:24:46.7451332Z 2025-05-07T20:24:46.7451337Z 2025-05-07T20:24:46.7451343Z 2025-05-07T20:24:46.7451348Z 2025-05-07T20:24:46.7451353Z 2025-05-07T20:24:46.7451624Z  2025-05-07T20:24:46.7451953Z 2025-05-07T20:24:46.7452281Z 2025-05-07T20:24:46.7452287Z 2025-05-07T20:24:46.7452292Z 2025-05-07T20:24:46.7452297Z 2025-05-07T20:24:46.7452302Z 2025-05-07T20:24:46.7452307Z 2025-05-07T20:24:46.7452313Z 2025-05-07T20:24:46.7452318Z 2025-05-07T20:24:46.7452323Z 2025-05-07T20:24:46.7452605Z  2025-05-07T20:24:46.7452941Z 2025-05-07T20:24:46.7452947Z 2025-05-07T20:24:46.7452952Z 2025-05-07T20:24:46.7452958Z 2025-05-07T20:24:46.7452963Z 2025-05-07T20:24:46.7452967Z 2025-05-07T20:24:46.7452972Z 2025-05-07T20:24:46.7452977Z 2025-05-07T20:24:46.7452982Z 2025-05-07T20:24:46.7452988Z 2025-05-07T20:24:46.7452993Z 2025-05-07T20:24:46.7453278Z  done 2025-05-07T20:24:46.8458358Z Preparing transaction: \ done 2025-05-07T20:24:47.0463660Z Verifying transaction: / - done 2025-05-07T20:24:47.1473188Z Executing transaction: | done 2025-05-07T20:24:47.3127374Z [INSTALL] Setting the C/C++ compiler symlinks ... 2025-05-07T20:24:51.2195116Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:24:51.2195695Z 2025-05-07T20:24:51.2209982Z 2025-05-07T20:24:51.2228317Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:24:51.2228939Z 2025-05-07T20:24:51.2240205Z 2025-05-07T20:24:51.2257790Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:24:51.2258556Z 2025-05-07T20:24:51.2270855Z 2025-05-07T20:24:51.2289168Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:24:51.2289722Z 2025-05-07T20:24:51.2303457Z 2025-05-07T20:24:53.1083347Z /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:24:53.1083664Z 2025-05-07T20:24:53.1719176Z [CHECK] Binary cc found in PATH 2025-05-07T20:24:55.0551460Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:24:55.0551749Z 2025-05-07T20:24:55.1175145Z [CHECK] Binary gcc found in PATH 2025-05-07T20:24:56.9989003Z /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:24:56.9989307Z 2025-05-07T20:24:57.0609132Z [CHECK] Binary c++ found in PATH 2025-05-07T20:24:58.9388232Z /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:24:58.9388512Z 2025-05-07T20:24:59.0015608Z [CHECK] Binary g++ found in PATH 2025-05-07T20:24:59.0019527Z [INFO] Printing out all preprocessor defines in the C compiler ... 2025-05-07T20:24:59.0020146Z + conda run -n build_binary cc -dM -E - 2025-05-07T20:24:59.0020377Z 2025-05-07T20:25:00.8953193Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:00.8953724Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:00.8954631Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:00.8954913Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:00.8955261Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:00.8955616Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:00.8955901Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:00.8956203Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:00.8956467Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:00.8956717Z #define __CHAR_BIT__ 8 2025-05-07T20:25:00.8956951Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:00.8957193Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:00.8957443Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:00.8957712Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:00.8957981Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:00.8958286Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:00.8958588Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:00.8958870Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:00.8959204Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:00.8959705Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:00.8960107Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:00.8960525Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:00.8960835Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:00.8961116Z #define __GCC_IEC_559 2 2025-05-07T20:25:00.8961354Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:00.8961627Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:00.8961892Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:00.8962164Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:00.8962493Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:00.8962817Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:00.8963077Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:00.8963355Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:00.8963628Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:00.8963884Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:00.8964144Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:00.8964415Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:00.8964669Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:00.8964920Z #define __INT8_C(c) c 2025-05-07T20:25:00.8965159Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:00.8965459Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:00.8965776Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:00.8966094Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:00.8966450Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:00.8966721Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:00.8966987Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:00.8967262Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:00.8967534Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:00.8967928Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:00.8968346Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:00.8968629Z #define __linux 1 2025-05-07T20:25:00.8968888Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:00.8969188Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:00.8969468Z #define __unix 1 2025-05-07T20:25:00.8969687Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:00.8969965Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:00.8970247Z #define __WINT_MIN__ 0U 2025-05-07T20:25:00.8970494Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:00.8970781Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:00.8971045Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:00.8971315Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:00.8971567Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:00.8981291Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:00.8981610Z #define __INT64_C(c) c ## L 2025-05-07T20:25:00.8981893Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:00.8982356Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:00.8982633Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:00.8983004Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:00.8983388Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:00.8983640Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:00.8983902Z #define __DBL_DIG__ 15 2025-05-07T20:25:00.8984140Z #define __FLT32_DIG__ 6 2025-05-07T20:25:00.8984443Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:00.8984800Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:00.8985055Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:00.8985379Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:00.8985731Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:00.8985983Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:00.8986246Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:00.8986627Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:00.8987043Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:00.8987325Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:00.8987677Z #define __unix__ 1 2025-05-07T20:25:00.8987905Z #define __INT_WIDTH__ 32 2025-05-07T20:25:00.8988150Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:00.8988392Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:00.8988651Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:00.8988918Z #define __UINT16_C(c) c 2025-05-07T20:25:00.8989154Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:00.8989414Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:00.8989778Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:00.8990150Z #define __gnu_linux__ 1 2025-05-07T20:25:00.8990391Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:00.8990673Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:00.8990958Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:00.8991228Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:00.8991493Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:00.8991749Z #define __GNUC__ 11 2025-05-07T20:25:00.8991965Z #define __pie__ 2 2025-05-07T20:25:00.8992184Z #define __MMX__ 1 2025-05-07T20:25:00.8992408Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:00.8992668Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:00.8992946Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:00.8993219Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:00.8993560Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:00.8993965Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:00.8994288Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:00.8994556Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:00.8994814Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:00.8995114Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:00.8995384Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:00.8995642Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:00.8995930Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:00.8996233Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:00.8996495Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:00.8996780Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:00.8997034Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:00.8997293Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:00.8997564Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:00.8997830Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:00.8998081Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:00.8998401Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:00.8998764Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:00.8999033Z #define __SSE2_MATH__ 1 2025-05-07T20:25:00.8999275Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:00.8999577Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:00.8999866Z #define __amd64 1 2025-05-07T20:25:00.9000085Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:00.9000354Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:00.9000757Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:00.9001068Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:00.9001326Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:00.9001610Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:00.9001861Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:00.9002127Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:00.9002391Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:00.9002649Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:00.9002918Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:00.9003197Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:00.9003444Z #define __x86_64 1 2025-05-07T20:25:00.9003758Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:00.9004124Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:00.9004592Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:00.9005055Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:00.9005534Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:00.9005920Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:00.9006629Z #define __LP64__ 1 2025-05-07T20:25:00.9006895Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:00.9007246Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:00.9007632Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:00.9007912Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:00.9008184Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:00.9008469Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:00.9008749Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:00.9009023Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:00.9009279Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:00.9009544Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:00.9009810Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:00.9010135Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:00.9010503Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:00.9010783Z #define __FLT_DIG__ 6 2025-05-07T20:25:00.9011008Z #define __NO_INLINE__ 1 2025-05-07T20:25:00.9011257Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:00.9011584Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:00.9011927Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:00.9012279Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:00.9012541Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:00.9012791Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:00.9013051Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:00.9013307Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:00.9013602Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:00.9013881Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:00.9014146Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:00.9014454Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:00.9014777Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:00.9015044Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:00.9015311Z #define __FLT128_DIG__ 33 2025-05-07T20:25:00.9015544Z #define __INT32_C(c) c 2025-05-07T20:25:00.9015789Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:00.9016068Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:00.9016344Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:00.9016625Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:00.9016943Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:00.9017244Z #define unix 1 2025-05-07T20:25:00.9017475Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:00.9017791Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:00.9018099Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:00.9018404Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:00.9018739Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:00.9018997Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:00.9019255Z #define __ELF__ 1 2025-05-07T20:25:00.9019488Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:00.9020034Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:00.9020314Z #define __FLT_RADIX__ 2 2025-05-07T20:25:00.9020576Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:00.9020939Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:00.9021303Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:00.9021560Z #define __SSE_MATH__ 1 2025-05-07T20:25:00.9021788Z #define __k8 1 2025-05-07T20:25:00.9022080Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:00.9022460Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:00.9022757Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:00.9023062Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:00.9023318Z #define __LDBL_DIG__ 18 2025-05-07T20:25:00.9023561Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:00.9023820Z #define __x86_64__ 1 2025-05-07T20:25:00.9024055Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:00.9024356Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:00.9024701Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:00.9025002Z #define __FLT64_DIG__ 15 2025-05-07T20:25:00.9025433Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:00.9025789Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:00.9026101Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:00.9026371Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:00.9026650Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:00.9026948Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:00.9027309Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:00.9027708Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:00.9027998Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:00.9028332Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:00.9028660Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:00.9029010Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:00.9029292Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:00.9029601Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:00.9029888Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:00.9030122Z #define __SEG_FS 1 2025-05-07T20:25:00.9030352Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:00.9030628Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:00.9030904Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:00.9031185Z #define __SEG_GS 1 2025-05-07T20:25:00.9031500Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:00.9031886Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:00.9032155Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:00.9032444Z #define __INT16_TYPE__ short int 2025-05-07T20:25:00.9032727Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:00.9033017Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:00.9033282Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:00.9033548Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:00.9033809Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:00.9034156Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:00.9034552Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:00.9034837Z #define linux 1 2025-05-07T20:25:00.9035067Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:00.9035348Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:00.9035622Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:00.9035870Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:00.9036131Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:00.9036395Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:00.9036736Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:00.9037149Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:00.9037485Z #define __code_model_small__ 1 2025-05-07T20:25:00.9037752Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:00.9038037Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:00.9038283Z #define __k8__ 1 2025-05-07T20:25:00.9038592Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:00.9038883Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:00.9039186Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:00.9039423Z #define __pic__ 2 2025-05-07T20:25:00.9039670Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:00.9039984Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:00.9040277Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:00.9040600Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:00.9040971Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:00.9041333Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:00.9041595Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:00.9041888Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:00.9042203Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:00.9042447Z #define __linux__ 1 2025-05-07T20:25:00.9042675Z #define __INT64_TYPE__ long int 2025-05-07T20:25:00.9042938Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:00.9043199Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:00.9043473Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:00.9043862Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:00.9044158Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:00.9044484Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:00.9044785Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:00.9045055Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:00.9045344Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:00.9045645Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:00.9045981Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:00.9046337Z #define __SSE__ 1 2025-05-07T20:25:00.9046564Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:00.9046906Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:00.9047246Z #define __amd64__ 1 2025-05-07T20:25:00.9047472Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:00.9047732Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:00.9047996Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:00.9048274Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:00.9048543Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:00.9048820Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:00.9049079Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:00.9049354Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:00.9049624Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:00.9049975Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:00.9050449Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:00.9050809Z #define _LP64 1 2025-05-07T20:25:00.9051021Z #define __UINT8_C(c) c 2025-05-07T20:25:00.9051271Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:00.9051541Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:00.9051807Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:00.9052210Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:00.9052524Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:00.9052887Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:00.9053357Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:00.9053732Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:00.9054030Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:00.9054340Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:00.9054709Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:00.9055080Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:00.9055340Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:00.9055685Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:00.9056056Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:00.9056320Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:00.9056566Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:00.9056823Z #define __FXSR__ 1 2025-05-07T20:25:00.9057226Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:00.9057687Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:00.9058111Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:00.9058421Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:00.9058672Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:00.9059011Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:00.9059370Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:00.9059609Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:00.9059849Z #define __PIC__ 2 2025-05-07T20:25:00.9060099Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:00.9060502Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:00.9060890Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:00.9061225Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:00.9061555Z #define __SSE2__ 1 2025-05-07T20:25:00.9061778Z #define __INT32_TYPE__ int 2025-05-07T20:25:00.9062029Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:00.9062455Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:00.9062788Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:00.9063148Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:00.9063422Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:00.9063688Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:00.9063959Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:00.9064239Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:00.9064482Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:00.9064732Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:00.9065024Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:00.9065327Z #define __PIE__ 2 2025-05-07T20:25:00.9065646Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:00.9066043Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:00.9066397Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:00.9066757Z #define __INT16_C(c) c 2025-05-07T20:25:00.9066990Z #define __STDC__ 1 2025-05-07T20:25:00.9067223Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:00.9067493Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:00.9067748Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:00.9068047Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:00.9068390Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:00.9068723Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:00.9068988Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:00.9069268Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:00.9069527Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:00.9069809Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:00.9070097Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:00.9070364Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:00.9070661Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:00.9071063Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:00.9071440Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:00.9071742Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:00.9072036Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:00.9072280Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:00.9072445Z 2025-05-07T20:25:00.9589588Z 2025-05-07T20:25:00.9590146Z [INFO] Printing out all preprocessor defines in the C++ compiler ... 2025-05-07T20:25:00.9590615Z + conda run -n build_binary c++ -dM -E -x c++ - 2025-05-07T20:25:00.9590856Z 2025-05-07T20:25:02.8432152Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:02.8432732Z #define __cpp_attributes 200809L 2025-05-07T20:25:02.8433220Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:25:02.8433694Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:02.8434076Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:02.8434365Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:02.8435043Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:02.8435408Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:02.8435704Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:25:02.8436011Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:02.8436327Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:02.8436594Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:02.8436841Z #define __CHAR_BIT__ 8 2025-05-07T20:25:02.8437085Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:02.8437336Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:02.8437590Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:02.8437862Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:02.8438139Z #define __cpp_static_assert 201411L 2025-05-07T20:25:02.8438428Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:02.8438723Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:02.8439025Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:02.8439316Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:02.8439647Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:02.8439975Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:02.8440529Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:02.8440941Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:02.8441255Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:02.8441538Z #define __GCC_IEC_559 2 2025-05-07T20:25:02.8441787Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:02.8442057Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:02.8442334Z #define __cpp_binary_literals 201304L 2025-05-07T20:25:02.8442627Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:02.8442916Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:25:02.8443239Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:02.8443558Z #define __cpp_variadic_templates 200704L 2025-05-07T20:25:02.8443889Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:02.8444216Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:02.8444497Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:02.8444769Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:02.8445060Z #define __cpp_variable_templates 201304L 2025-05-07T20:25:02.8445366Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:02.8445638Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:02.8445898Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:02.8446179Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:25:02.8446517Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:25:02.8446846Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:02.8447121Z #define __INT8_C(c) c 2025-05-07T20:25:02.8447363Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:02.8447633Z #define __cpp_variadic_using 201611L 2025-05-07T20:25:02.8447959Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:02.8448289Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:02.8448568Z #define __cpp_capture_star_this 201603L 2025-05-07T20:25:02.8448854Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:02.8449179Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:02.8449544Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:02.8449824Z #define __cpp_if_constexpr 201606L 2025-05-07T20:25:02.8450105Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:02.8450373Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:02.8450647Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:02.8450928Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:02.8460146Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:02.8460569Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:02.8460862Z #define __linux 1 2025-05-07T20:25:02.8461091Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:02.8461368Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:02.8461651Z #define __unix 1 2025-05-07T20:25:02.8461880Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:02.8462170Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:25:02.8462576Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:02.8462853Z #define __WINT_MIN__ 0U 2025-05-07T20:25:02.8463108Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:02.8463384Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:02.8463662Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:02.8463930Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:02.8464180Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:02.8464462Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:02.8464760Z #define __INT64_C(c) c ## L 2025-05-07T20:25:02.8465017Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:02.8465317Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:02.8465589Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:02.8465887Z #define __cpp_aligned_new 201606L 2025-05-07T20:25:02.8466170Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:02.8466439Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:02.8466802Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:02.8467183Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:02.8467441Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:02.8467812Z #define __cpp_decltype_auto 201304L 2025-05-07T20:25:02.8468084Z #define __DBL_DIG__ 15 2025-05-07T20:25:02.8468317Z #define __FLT32_DIG__ 6 2025-05-07T20:25:02.8468627Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:02.8468973Z #define __GXX_WEAK__ 1 2025-05-07T20:25:02.8469212Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:02.8469464Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:02.8469788Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:02.8470139Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:02.8470401Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:02.8470697Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:25:02.8471016Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:25:02.8471432Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:02.8471832Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:02.8472107Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:02.8472361Z #define __unix__ 1 2025-05-07T20:25:02.8472578Z #define __INT_WIDTH__ 32 2025-05-07T20:25:02.8472818Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:02.8473052Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:02.8473303Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:02.8473567Z #define __UINT16_C(c) c 2025-05-07T20:25:02.8473793Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:02.8474044Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:02.8474402Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:02.8474755Z #define __gnu_linux__ 1 2025-05-07T20:25:02.8474988Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:02.8475247Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:02.8475514Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:02.8475796Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:02.8476063Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:02.8476325Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:02.8476573Z #define __GNUC__ 11 2025-05-07T20:25:02.8476784Z #define __GXX_RTTI 1 2025-05-07T20:25:02.8477004Z #define __pie__ 2 2025-05-07T20:25:02.8477207Z #define __MMX__ 1 2025-05-07T20:25:02.8477432Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:02.8477695Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:02.8477968Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:02.8478239Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:02.8478485Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:02.8478777Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:25:02.8479094Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:02.8479474Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:02.8479861Z #define __cpp_raw_strings 200710L 2025-05-07T20:25:02.8480165Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:02.8480482Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:02.8480842Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:02.8481105Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:02.8481418Z #define __cpp_fold_expressions 201603L 2025-05-07T20:25:02.8481714Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:02.8481972Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:02.8482233Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:02.8482514Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:02.8482801Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:02.8483070Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:02.8483352Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:02.8483600Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:02.8483863Z #define __cplusplus 201703L 2025-05-07T20:25:02.8484132Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:25:02.8484409Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:02.8484668Z #define __DEPRECATED 1 2025-05-07T20:25:02.8484922Z #define __cpp_rvalue_references 200610L 2025-05-07T20:25:02.8485224Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:02.8485476Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:02.8485796Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:02.8486243Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:02.8486510Z #define __SSE2_MATH__ 1 2025-05-07T20:25:02.8486756Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:02.8487057Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:02.8487346Z #define __amd64 1 2025-05-07T20:25:02.8487571Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:02.8487845Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:02.8488104Z #define __GNUG__ 11 2025-05-07T20:25:02.8488360Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:02.8488673Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:02.8488920Z #define __cpp_nsdmi 200809L 2025-05-07T20:25:02.8489179Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:02.8489452Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:02.8489721Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:02.8490033Z #define __cpp_initializer_lists 200806L 2025-05-07T20:25:02.8490327Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:02.8490599Z #define __cpp_hex_float 201603L 2025-05-07T20:25:02.8490861Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:02.8491130Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:02.8491402Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:02.8491664Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:02.8491932Z #define __x86_64 1 2025-05-07T20:25:02.8492275Z #define __cpp_lambdas 200907L 2025-05-07T20:25:02.8492540Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:02.8492917Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:02.8493307Z #define __cpp_template_auto 201606L 2025-05-07T20:25:02.8493661Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:02.8494117Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:02.8494596Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:02.8494988Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:02.8495242Z #define __LP64__ 1 2025-05-07T20:25:02.8495476Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:02.8495830Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:02.8496204Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:02.8496479Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:02.8496761Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:02.8497029Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:02.8497302Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:02.8497563Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:02.8497829Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:02.8498154Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:02.8498519Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:02.8498794Z #define __FLT_DIG__ 6 2025-05-07T20:25:02.8499017Z #define __NO_INLINE__ 1 2025-05-07T20:25:02.8499413Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:02.8499770Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:02.8500121Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:02.8500379Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:02.8500645Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:02.8500895Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:02.8501172Z #define __cpp_unicode_characters 201411L 2025-05-07T20:25:02.8501472Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:02.8501719Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:02.8502013Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:02.8502296Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:02.8502563Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:02.8502859Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:02.8503200Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:25:02.8503489Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:02.8503744Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:02.8504008Z #define __FLT128_DIG__ 33 2025-05-07T20:25:02.8504244Z #define __INT32_C(c) c 2025-05-07T20:25:02.8504563Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:02.8504840Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:02.8505117Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:02.8505388Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:02.8505699Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:02.8506006Z #define unix 1 2025-05-07T20:25:02.8506561Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:02.8506824Z #define __cpp_rtti 199711L 2025-05-07T20:25:02.8507083Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:02.8507394Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:02.8507687Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:02.8507990Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:02.8508313Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:02.8508552Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:02.8508843Z #define __cpp_digit_separators 201309L 2025-05-07T20:25:02.8509118Z #define __ELF__ 1 2025-05-07T20:25:02.8509343Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:02.8509624Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:02.8509899Z #define __FLT_RADIX__ 2 2025-05-07T20:25:02.8510137Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:02.8510490Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:02.8510849Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:02.8511112Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:25:02.8511383Z #define __k8 1 2025-05-07T20:25:02.8511677Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:02.8512048Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:02.8512331Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:02.8512624Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:02.8512880Z #define __LDBL_DIG__ 18 2025-05-07T20:25:02.8513115Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:02.8513372Z #define __x86_64__ 1 2025-05-07T20:25:02.8513606Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:02.8513901Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:02.8514235Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:02.8514538Z #define __FLT64_DIG__ 15 2025-05-07T20:25:02.8514807Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:02.8515152Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:02.8515469Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:02.8515734Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:02.8516000Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:02.8516293Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:02.8516657Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:02.8517044Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:02.8517332Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:02.8517798Z #define __cpp_unicode_literals 200710L 2025-05-07T20:25:02.8518113Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:02.8518435Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:02.8518726Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:02.8518997Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:02.8519298Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:02.8519574Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:02.8519808Z #define __SEG_FS 1 2025-05-07T20:25:02.8520024Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:02.8520295Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:02.8520567Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:02.8520840Z #define __SEG_GS 1 2025-05-07T20:25:02.8521151Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:02.8521529Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:02.8521791Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:02.8522075Z #define __INT16_TYPE__ short int 2025-05-07T20:25:02.8522359Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:02.8522656Z #define __cpp_structured_bindings 201606L 2025-05-07T20:25:02.8523104Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:02.8523348Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:02.8523598Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:02.8523939Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:02.8524327Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:02.8524638Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:25:02.8524960Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:25:02.8525265Z #define linux 1 2025-05-07T20:25:02.8525489Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:02.8525761Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:02.8526035Z #define __EXCEPTIONS 1 2025-05-07T20:25:02.8526276Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:02.8526575Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:02.8526848Z #define __cpp_range_based_for 201603L 2025-05-07T20:25:02.8527137Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:02.8527492Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:02.8527889Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:25:02.8528232Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:02.8528562Z #define __code_model_small__ 1 2025-05-07T20:25:02.8528835Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:02.8529139Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:25:02.8529446Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:02.8529725Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:25:02.8530018Z #define __k8__ 1 2025-05-07T20:25:02.8530239Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:02.8530523Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:02.8530818Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:02.8531050Z #define __pic__ 2 2025-05-07T20:25:02.8531300Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:02.8531615Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:02.8531880Z #define __cpp_decltype 200707L 2025-05-07T20:25:02.8532239Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:02.8532569Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:02.8532932Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:02.8533293Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:02.8533587Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:02.8533905Z #define __cpp_inline_variables 201606L 2025-05-07T20:25:02.8534194Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:02.8534445Z #define __linux__ 1 2025-05-07T20:25:02.8534670Z #define __INT64_TYPE__ long int 2025-05-07T20:25:02.8534928Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:02.8535190Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:02.8535466Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:02.8535746Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:25:02.8536066Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:02.8536448Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:02.8536766Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:02.8537034Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:02.8537327Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:02.8537617Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:02.8537951Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:02.8538310Z #define __SSE__ 1 2025-05-07T20:25:02.8538535Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:02.8538868Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:02.8539212Z #define __amd64__ 1 2025-05-07T20:25:02.8539438Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:02.8539685Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:02.8539954Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:02.8540220Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:02.8540487Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:02.8540760Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:02.8541035Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:02.8541380Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:02.8541731Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:02.8542198Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:02.8542551Z #define _LP64 1 2025-05-07T20:25:02.8542766Z #define __UINT8_C(c) c 2025-05-07T20:25:02.8543004Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:02.8543271Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:02.8543536Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:02.8543799Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:02.8544162Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:02.8544628Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:02.8545005Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:02.8545299Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:02.8545612Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:02.8545923Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:25:02.8546312Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:02.8546683Z #define __STDCPP_THREADS__ 1 2025-05-07T20:25:02.8546941Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:02.8547203Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:02.8547545Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:02.8547906Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:02.8548164Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:02.8548412Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:02.8548656Z #define __FXSR__ 1 2025-05-07T20:25:02.8548957Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:02.8549415Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:02.8549819Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:02.8550136Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:02.8550401Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:25:02.8550709Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:02.8550997Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:02.8551266Z #define __cpp_alias_templates 200704L 2025-05-07T20:25:02.8551630Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:02.8551993Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:02.8552260Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:02.8552506Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:02.8552737Z #define __PIC__ 2 2025-05-07T20:25:02.8552988Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:02.8553387Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:02.8553767Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:02.8554100Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:02.8554536Z #define __cpp_constexpr 201603L 2025-05-07T20:25:02.8554802Z #define __SSE2__ 1 2025-05-07T20:25:02.8555032Z #define __cpp_deduction_guides 201703L 2025-05-07T20:25:02.8555322Z #define __INT32_TYPE__ int 2025-05-07T20:25:02.8555569Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:02.8555824Z #define __cpp_exceptions 199711L 2025-05-07T20:25:02.8556096Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:02.8556428Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:02.8556781Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:02.8557052Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:02.8557318Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:02.8557583Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:02.8557856Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:02.8558102Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:02.8558348Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:25:02.8558638Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:02.8558926Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:02.8559226Z #define __PIE__ 2 2025-05-07T20:25:02.8559545Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:02.8560046Z #define __cpp_template_template_args 201611L 2025-05-07T20:25:02.8560357Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:02.8560696Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:02.8561061Z #define __INT16_C(c) c 2025-05-07T20:25:02.8561285Z #define __STDC__ 1 2025-05-07T20:25:02.8561496Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:02.8561747Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:02.8562014Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:02.8562262Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:02.8562557Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:02.8562902Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:02.8563236Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:02.8563494Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:02.8563784Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:25:02.8564064Z #define __SSE_MATH__ 1 2025-05-07T20:25:02.8564302Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:02.8564582Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:25:02.8564887Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:02.8565161Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:02.8565449Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:02.8565719Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:02.8566009Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:02.8566413Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:02.8566791Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:02.8567084Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:02.8567371Z #define _GNU_SOURCE 1 2025-05-07T20:25:02.8567612Z #define __cpp_init_captures 201304L 2025-05-07T20:25:02.8567893Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:02.8568139Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:02.8568301Z 2025-05-07T20:25:02.9055088Z 2025-05-07T20:25:02.9055529Z + conda run -n build_binary c++ --version 2025-05-07T20:25:02.9055759Z 2025-05-07T20:25:04.7865413Z c++ (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:25:04.7865855Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:25:04.7874457Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:25:04.7875066Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:25:04.7875457Z 2025-05-07T20:25:04.7875462Z 2025-05-07T20:25:04.8498883Z 2025-05-07T20:25:04.8499395Z [INFO] Printing the default version of the C standard used by the compiler ... 2025-05-07T20:25:04.8500185Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__ 2025-05-07T20:25:04.8500513Z 2025-05-07T20:25:06.8031175Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:06.8033511Z 2025-05-07T20:25:06.8035030Z [INFO] Printing the default version of the C++ standard used by the compiler ... 2025-05-07T20:25:06.8036368Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus 2025-05-07T20:25:06.8037007Z 2025-05-07T20:25:08.7536554Z #define __cplusplus 201703L 2025-05-07T20:25:08.7538667Z 2025-05-07T20:25:08.7539584Z [INSTALL] Successfully installed C/C++ compilers 2025-05-07T20:25:08.7574993Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.8.0 2025-05-07T20:25:08.7575416Z . $PRELUDE; install_cuda $BUILD_ENV 12.8.0 2025-05-07T20:25:08.7588427Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:25:08.7588774Z env: 2025-05-07T20:25:08.7589001Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:25:08.7589318Z BUILD_ENV: build_binary 2025-05-07T20:25:08.7589569Z BUILD_TARGET: genai 2025-05-07T20:25:08.7589797Z BUILD_VARIANT: cuda 2025-05-07T20:25:08.7590039Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:25:08.7590300Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:25:08.7590602Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:25:08.7590953Z ##[endgroup] 2025-05-07T20:25:09.0968811Z ################################################################################ 2025-05-07T20:25:09.0969165Z # Install CUDA 2025-05-07T20:25:09.0969367Z # 2025-05-07T20:25:09.0985783Z # [2025-05-07T20:25:09.098Z] + install_cuda build_binary 12.8.0 2025-05-07T20:25:09.0986174Z ################################################################################ 2025-05-07T20:25:09.0986386Z 2025-05-07T20:25:09.1002325Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:25:09.1927702Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:25:09.1928258Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:25:09.1933701Z + conda clean --packages --tarball -y 2025-05-07T20:25:09.1933971Z 2025-05-07T20:25:10.0609502Z Will remove 40 (182.7 MB) tarball(s). 2025-05-07T20:25:10.0610076Z Will remove 7 (108.6 MB) package(s). 2025-05-07T20:25:10.1238879Z 2025-05-07T20:25:10.1248507Z + conda clean --all -y 2025-05-07T20:25:10.1248708Z 2025-05-07T20:25:10.7943049Z There are no unused tarball(s) to remove. 2025-05-07T20:25:10.7943467Z Will remove 1 index cache(s). 2025-05-07T20:25:10.7943779Z There are no unused package(s) to remove. 2025-05-07T20:25:10.7944124Z There are no tempfile(s) to remove. 2025-05-07T20:25:10.7944454Z There are no logfile(s) to remove. 2025-05-07T20:25:10.8572425Z 2025-05-07T20:25:10.8586893Z [INSTALL] Installing CUDA 12.8.0 ... 2025-05-07T20:25:10.8610950Z [EXEC] [ATTEMPT 0/3] + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.8.0 2025-05-07T20:25:11.7719407Z Channels: 2025-05-07T20:25:11.7719731Z - conda-forge 2025-05-07T20:25:11.7720027Z Platform: linux-64 2025-05-07T20:25:22.3559659Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:25:23.4664875Z Solving environment: - \ | / - done 2025-05-07T20:25:23.5430708Z 2025-05-07T20:25:23.5431233Z ## Package Plan ## 2025-05-07T20:25:23.5431427Z 2025-05-07T20:25:23.5431683Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:23.5432014Z 2025-05-07T20:25:23.5432111Z added / updated specs: 2025-05-07T20:25:23.5432362Z - cuda=12.8.0 2025-05-07T20:25:23.5432497Z 2025-05-07T20:25:23.5432513Z 2025-05-07T20:25:23.5432641Z The following packages will be downloaded: 2025-05-07T20:25:23.5432860Z 2025-05-07T20:25:23.5432982Z package | build 2025-05-07T20:25:23.5433419Z ---------------------------|----------------- 2025-05-07T20:25:23.5433928Z alsa-lib-1.2.14 | hb9d3cd8_0 553 KB conda-forge 2025-05-07T20:25:23.5434492Z attr-2.5.1 | h166bdaf_1 69 KB conda-forge 2025-05-07T20:25:23.5434992Z binutils-2.40 | h4852527_7 31 KB conda-forge 2025-05-07T20:25:23.5435559Z c-compiler-1.5.2 | h0b41bf4_0 6 KB conda-forge 2025-05-07T20:25:23.5436175Z cuda-12.8.0 | ha804496_0 26 KB conda-forge 2025-05-07T20:25:23.5436816Z cuda-cccl_linux-64-12.8.55 | ha770c72_1 1.0 MB conda-forge 2025-05-07T20:25:23.5438701Z cuda-command-line-tools-12.8.0| ha770c72_0 20 KB conda-forge 2025-05-07T20:25:23.5439437Z cuda-compiler-12.8.0 | hbad6d8a_0 20 KB conda-forge 2025-05-07T20:25:23.5440088Z cuda-crt-dev_linux-64-12.8.61| ha770c72_1 90 KB conda-forge 2025-05-07T20:25:23.5440566Z cuda-crt-tools-12.8.61 | ha770c72_1 27 KB conda-forge 2025-05-07T20:25:23.5441025Z cuda-cudart-12.8.57 | h5888daf_1 22 KB conda-forge 2025-05-07T20:25:23.5441498Z cuda-cudart-dev-12.8.57 | h5888daf_1 23 KB conda-forge 2025-05-07T20:25:23.5441997Z cuda-cudart-dev_linux-64-12.8.57| h3f2d84a_1 377 KB conda-forge 2025-05-07T20:25:23.5442500Z cuda-cudart-static-12.8.57 | h5888daf_1 22 KB conda-forge 2025-05-07T20:25:23.5443198Z cuda-cudart-static_linux-64-12.8.57| h3f2d84a_1 950 KB conda-forge 2025-05-07T20:25:23.5443720Z cuda-cudart_linux-64-12.8.57| h3f2d84a_1 188 KB conda-forge 2025-05-07T20:25:23.5444211Z cuda-cuobjdump-12.8.55 | hbd13f7d_0 227 KB conda-forge 2025-05-07T20:25:23.5444662Z cuda-cupti-12.8.57 | hbd13f7d_0 1.8 MB conda-forge 2025-05-07T20:25:23.5445126Z cuda-cupti-dev-12.8.57 | h5888daf_0 4.0 MB conda-forge 2025-05-07T20:25:23.5445591Z cuda-cuxxfilt-12.8.55 | hbd13f7d_0 211 KB conda-forge 2025-05-07T20:25:23.5446058Z cuda-driver-dev-12.8.57 | h5888daf_1 22 KB conda-forge 2025-05-07T20:25:23.5446559Z cuda-driver-dev_linux-64-12.8.90| h3f2d84a_1 36 KB conda-forge 2025-05-07T20:25:23.5447024Z cuda-gdb-12.8.55 | h50b4baa_0 353 KB conda-forge 2025-05-07T20:25:23.5447469Z cuda-libraries-12.8.0 | ha770c72_0 20 KB conda-forge 2025-05-07T20:25:23.5447955Z cuda-libraries-dev-12.8.0 | ha770c72_0 20 KB conda-forge 2025-05-07T20:25:23.5448424Z cuda-nsight-12.8.55 | h7938cbb_0 113.2 MB conda-forge 2025-05-07T20:25:23.5448868Z cuda-nvcc-12.8.61 | hcdd1206_0 23 KB conda-forge 2025-05-07T20:25:23.5449338Z cuda-nvcc-dev_linux-64-12.8.61| he91c749_1 12.7 MB conda-forge 2025-05-07T20:25:23.5449825Z cuda-nvcc-impl-12.8.61 | h85509e4_1 25 KB conda-forge 2025-05-07T20:25:23.5450294Z cuda-nvcc-tools-12.8.61 | he02047a_1 24.5 MB conda-forge 2025-05-07T20:25:23.5450768Z cuda-nvcc_linux-64-12.8.61 | h04802cd_0 25 KB conda-forge 2025-05-07T20:25:23.5451242Z cuda-nvdisasm-12.8.55 | hbd13f7d_0 4.9 MB conda-forge 2025-05-07T20:25:23.5451701Z cuda-nvml-dev-12.8.55 | hbd13f7d_0 134 KB conda-forge 2025-05-07T20:25:23.5452283Z cuda-nvprof-12.8.57 | hbd13f7d_0 2.5 MB conda-forge 2025-05-07T20:25:23.5452741Z cuda-nvprune-12.8.55 | hbd13f7d_0 68 KB conda-forge 2025-05-07T20:25:23.5453197Z cuda-nvrtc-12.8.61 | hbd13f7d_0 63.1 MB conda-forge 2025-05-07T20:25:23.5453643Z cuda-nvrtc-dev-12.8.61 | h5888daf_0 34 KB conda-forge 2025-05-07T20:25:23.5454092Z cuda-nvtx-12.8.55 | hbd13f7d_0 31 KB conda-forge 2025-05-07T20:25:23.5454557Z cuda-nvvm-dev_linux-64-12.8.61| ha770c72_1 25 KB conda-forge 2025-05-07T20:25:23.5455037Z cuda-nvvm-impl-12.8.61 | he02047a_1 20.8 MB conda-forge 2025-05-07T20:25:23.5455497Z cuda-nvvm-tools-12.8.61 | he02047a_1 23.5 MB conda-forge 2025-05-07T20:25:23.5455945Z cuda-nvvp-12.8.57 | hbd13f7d_0 112.4 MB conda-forge 2025-05-07T20:25:23.5456382Z cuda-opencl-12.8.55 | hbd13f7d_0 29 KB conda-forge 2025-05-07T20:25:23.5456841Z cuda-opencl-dev-12.8.55 | h5888daf_0 95 KB conda-forge 2025-05-07T20:25:23.5457464Z cuda-profiler-api-12.8.55 | h7938cbb_0 22 KB conda-forge 2025-05-07T20:25:23.5457941Z cuda-runtime-12.8.0 | ha804496_0 20 KB conda-forge 2025-05-07T20:25:23.5458417Z cuda-sanitizer-api-12.8.55 | hbd13f7d_0 8.8 MB conda-forge 2025-05-07T20:25:23.5458884Z cuda-toolkit-12.8.0 | ha804496_0 20 KB conda-forge 2025-05-07T20:25:23.5459324Z cuda-tools-12.8.0 | ha770c72_0 19 KB conda-forge 2025-05-07T20:25:23.5459764Z cuda-version-12.8 | h5d125a7_3 21 KB conda-forge 2025-05-07T20:25:23.5460225Z cuda-visual-tools-12.8.0 | ha770c72_0 20 KB conda-forge 2025-05-07T20:25:23.5460695Z cxx-compiler-1.5.2 | hf52228f_0 6 KB conda-forge 2025-05-07T20:25:23.5461248Z dbus-1.13.6 | h5008d03_3 604 KB conda-forge 2025-05-07T20:25:23.5461714Z font-ttf-dejavu-sans-mono-2.37| hab24e00_0 388 KB conda-forge 2025-05-07T20:25:23.5462245Z font-ttf-inconsolata-3.000 | h77eed37_0 94 KB conda-forge 2025-05-07T20:25:23.5462770Z font-ttf-source-code-pro-2.038| h77eed37_0 684 KB conda-forge 2025-05-07T20:25:23.5463264Z font-ttf-ubuntu-0.83 | h77eed37_3 1.5 MB conda-forge 2025-05-07T20:25:23.5463713Z fontconfig-2.15.0 | h7e30c49_1 259 KB conda-forge 2025-05-07T20:25:23.5464182Z fonts-conda-ecosystem-1 | 0 4 KB conda-forge 2025-05-07T20:25:23.5464653Z fonts-conda-forge-1 | 0 4 KB conda-forge 2025-05-07T20:25:23.5465097Z freetype-2.13.3 | ha770c72_1 168 KB conda-forge 2025-05-07T20:25:23.5465504Z gcc-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:25:23.5465919Z gds-tools-1.13.0.11 | h5888daf_0 37.9 MB conda-forge 2025-05-07T20:25:23.5466318Z gmp-6.3.0 | hac33072_2 449 KB conda-forge 2025-05-07T20:25:23.5466706Z gxx-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:25:23.5467108Z keyutils-1.6.1 | h166bdaf_0 115 KB conda-forge 2025-05-07T20:25:23.5467514Z krb5-1.21.3 | h659f571_0 1.3 MB conda-forge 2025-05-07T20:25:23.5467902Z libcap-2.71 | h39aace5_0 100 KB conda-forge 2025-05-07T20:25:23.5468327Z libcublas-12.8.3.14 | h9ab20c4_0 460.2 MB conda-forge 2025-05-07T20:25:23.5468786Z libcublas-dev-12.8.3.14 | h9ab20c4_0 89 KB conda-forge 2025-05-07T20:25:23.5469238Z libcufft-11.3.3.41 | hbd13f7d_0 147.4 MB conda-forge 2025-05-07T20:25:23.5469688Z libcufft-dev-11.3.3.41 | h5888daf_0 33 KB conda-forge 2025-05-07T20:25:23.5470147Z libcufile-1.13.0.11 | h12f29b5_0 939 KB conda-forge 2025-05-07T20:25:23.5470607Z libcufile-dev-1.13.0.11 | h5888daf_0 35 KB conda-forge 2025-05-07T20:25:23.5471060Z libcurand-10.3.9.55 | hbd13f7d_0 43.6 MB conda-forge 2025-05-07T20:25:23.5471521Z libcurand-dev-10.3.9.55 | h5888daf_0 265 KB conda-forge 2025-05-07T20:25:23.5471984Z libcusolver-11.7.2.55 | h9ab20c4_0 156.9 MB conda-forge 2025-05-07T20:25:23.5472450Z libcusolver-dev-11.7.2.55 | h9ab20c4_0 59 KB conda-forge 2025-05-07T20:25:23.5472932Z libcusparse-12.5.7.53 | hbd13f7d_0 164.9 MB conda-forge 2025-05-07T20:25:23.5473410Z libcusparse-dev-12.5.7.53 | h5888daf_0 51 KB conda-forge 2025-05-07T20:25:23.5473885Z libedit-3.1.20191231 | he28a2e2_2 121 KB conda-forge 2025-05-07T20:25:23.5474339Z libfreetype-2.13.3 | ha770c72_1 8 KB conda-forge 2025-05-07T20:25:23.5474792Z libfreetype6-2.13.3 | h48d6fc4_1 371 KB conda-forge 2025-05-07T20:25:23.5475350Z libgcrypt-lib-1.11.0 | hb9d3cd8_2 572 KB conda-forge 2025-05-07T20:25:23.5475836Z libglib-2.84.0 | h2ff4ddf_0 3.8 MB conda-forge 2025-05-07T20:25:23.5476254Z libglvnd-1.7.0 | ha4b6fd6_2 129 KB conda-forge 2025-05-07T20:25:23.5476689Z libgpg-error-1.55 | h3f2d84a_0 305 KB conda-forge 2025-05-07T20:25:23.5477118Z libiconv-1.18 | h4ce23a2_1 696 KB conda-forge 2025-05-07T20:25:23.5477520Z libnl-3.11.0 | hb9d3cd8_0 724 KB conda-forge 2025-05-07T20:25:23.5477934Z libnpp-12.3.3.65 | hbd13f7d_0 130.6 MB conda-forge 2025-05-07T20:25:23.5478367Z libnpp-dev-12.3.3.65 | h5888daf_0 443 KB conda-forge 2025-05-07T20:25:23.5478882Z libnuma-2.0.18 | h4ab18f5_2 42 KB conda-forge 2025-05-07T20:25:23.5479319Z libnvfatbin-12.8.55 | hbd13f7d_0 793 KB conda-forge 2025-05-07T20:25:23.5479789Z libnvfatbin-dev-12.8.55 | h5888daf_0 26 KB conda-forge 2025-05-07T20:25:23.5480261Z libnvjitlink-12.8.61 | hbd13f7d_0 28.7 MB conda-forge 2025-05-07T20:25:23.5480730Z libnvjitlink-dev-12.8.61 | h5888daf_0 25 KB conda-forge 2025-05-07T20:25:23.5481193Z libnvjpeg-12.3.5.57 | h97fd463_0 3.0 MB conda-forge 2025-05-07T20:25:23.5481650Z libnvjpeg-dev-12.3.5.57 | ha770c72_0 31 KB conda-forge 2025-05-07T20:25:23.5482101Z libopengl-1.7.0 | ha4b6fd6_2 50 KB conda-forge 2025-05-07T20:25:23.5482514Z libpng-1.6.47 | h943b412_0 282 KB conda-forge 2025-05-07T20:25:23.5482947Z libsqlite-3.49.2 | hee588c1_0 895 KB conda-forge 2025-05-07T20:25:23.5483380Z libsystemd0-256.9 | h2774228_0 401 KB conda-forge 2025-05-07T20:25:23.5483818Z libudev1-257.4 | h9a4d06a_0 140 KB conda-forge 2025-05-07T20:25:23.5484233Z libxcb-1.17.0 | h8a09558_0 387 KB conda-forge 2025-05-07T20:25:23.5484662Z libxkbcommon-1.8.0 | hc4a0caf_0 627 KB conda-forge 2025-05-07T20:25:23.5485108Z libxkbfile-1.1.0 | h166bdaf_1 111 KB conda-forge 2025-05-07T20:25:23.5485528Z libxml2-2.13.5 | h064dc61_0 673 KB conda-forge 2025-05-07T20:25:23.5485941Z libzlib-1.3.1 | hb9d3cd8_2 60 KB conda-forge 2025-05-07T20:25:23.5486341Z lz4-c-1.9.4 | hcb278e6_0 140 KB conda-forge 2025-05-07T20:25:23.5486781Z nsight-compute-2025.1.0.14 | hb5ebaad_0 320.6 MB conda-forge 2025-05-07T20:25:23.5487235Z nspr-4.36 | h5888daf_0 225 KB conda-forge 2025-05-07T20:25:23.5487628Z nss-3.111 | h159eef7_0 1.9 MB conda-forge 2025-05-07T20:25:23.5488018Z ocl-icd-2.3.3 | hb9d3cd8_0 104 KB conda-forge 2025-05-07T20:25:23.5488458Z opencl-headers-2024.10.24 | h5888daf_0 53 KB conda-forge 2025-05-07T20:25:23.5488904Z pcre2-10.44 | hc749103_2 934 KB conda-forge 2025-05-07T20:25:23.5489339Z pthread-stubs-0.4 | hb9d3cd8_1002 8 KB conda-forge 2025-05-07T20:25:23.5489784Z rdma-core-55.0 | h5888daf_0 1.2 MB conda-forge 2025-05-07T20:25:23.5490194Z sqlite-3.32.3 | hcee41ef_1 1.4 MB conda-forge 2025-05-07T20:25:23.5490604Z tk-8.6.13 |noxft_h4845f30_101 3.2 MB conda-forge 2025-05-07T20:25:23.5491014Z wayland-1.23.1 | h3e06ad9_0 314 KB conda-forge 2025-05-07T20:25:23.5491417Z xcb-util-0.4.1 | hb711507_2 19 KB conda-forge 2025-05-07T20:25:23.5492008Z xcb-util-cursor-0.1.5 | hb9d3cd8_0 20 KB conda-forge 2025-05-07T20:25:23.5492563Z xcb-util-image-0.4.0 | hb711507_2 24 KB conda-forge 2025-05-07T20:25:23.5493030Z xcb-util-keysyms-0.4.1 | hb711507_0 14 KB conda-forge 2025-05-07T20:25:23.5493516Z xcb-util-renderutil-0.3.10 | hb711507_0 17 KB conda-forge 2025-05-07T20:25:23.5493986Z xcb-util-wm-0.4.2 | hb711507_0 50 KB conda-forge 2025-05-07T20:25:23.5494450Z xkeyboard-config-2.44 | hb9d3cd8_0 384 KB conda-forge 2025-05-07T20:25:23.5494906Z xorg-libice-1.1.2 | hb9d3cd8_0 57 KB conda-forge 2025-05-07T20:25:23.5495340Z xorg-libsm-1.2.6 | he73a12e_0 27 KB conda-forge 2025-05-07T20:25:23.5495888Z xorg-libx11-1.8.12 | h4f16b4b_0 816 KB conda-forge 2025-05-07T20:25:23.5496331Z xorg-libxau-1.0.12 | hb9d3cd8_0 14 KB conda-forge 2025-05-07T20:25:23.5496804Z xorg-libxcomposite-0.4.6 | hb9d3cd8_2 13 KB conda-forge 2025-05-07T20:25:23.5497299Z xorg-libxdamage-1.1.6 | hb9d3cd8_0 13 KB conda-forge 2025-05-07T20:25:23.5497764Z xorg-libxdmcp-1.1.5 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:25:23.5498210Z xorg-libxext-1.3.6 | hb9d3cd8_0 49 KB conda-forge 2025-05-07T20:25:23.5498666Z xorg-libxfixes-6.0.1 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:25:23.5499113Z xorg-libxi-1.8.2 | hb9d3cd8_0 46 KB conda-forge 2025-05-07T20:25:23.5499559Z xorg-libxrandr-1.5.4 | hb9d3cd8_0 29 KB conda-forge 2025-05-07T20:25:23.5500024Z xorg-libxrender-0.9.12 | hb9d3cd8_0 32 KB conda-forge 2025-05-07T20:25:23.5500493Z xorg-libxtst-1.2.5 | hb9d3cd8_3 32 KB conda-forge 2025-05-07T20:25:23.5500915Z zlib-1.3.1 | hb9d3cd8_2 90 KB conda-forge 2025-05-07T20:25:23.5501301Z zstd-1.5.7 | hb8e6e7a_2 554 KB conda-forge 2025-05-07T20:25:23.5501679Z ------------------------------------------------------------ 2025-05-07T20:25:23.5502026Z Total: 1.88 GB 2025-05-07T20:25:23.5502236Z 2025-05-07T20:25:23.5502378Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:23.5502605Z 2025-05-07T20:25:23.5502809Z alsa-lib conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 2025-05-07T20:25:23.5503236Z attr conda-forge/linux-64::attr-2.5.1-h166bdaf_1 2025-05-07T20:25:23.5503652Z binutils conda-forge/linux-64::binutils-2.40-h4852527_7 2025-05-07T20:25:23.5504121Z c-compiler conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 2025-05-07T20:25:23.5504552Z cuda conda-forge/noarch::cuda-12.8.0-ha804496_0 2025-05-07T20:25:23.5505038Z cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.8.55-ha770c72_1 2025-05-07T20:25:23.5505652Z cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.8.0-ha770c72_0 2025-05-07T20:25:23.5506774Z cuda-compiler conda-forge/noarch::cuda-compiler-12.8.0-hbad6d8a_0 2025-05-07T20:25:23.5507336Z cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.8.61-ha770c72_1 2025-05-07T20:25:23.5507906Z cuda-crt-tools conda-forge/linux-64::cuda-crt-tools-12.8.61-ha770c72_1 2025-05-07T20:25:23.5508435Z cuda-cudart conda-forge/linux-64::cuda-cudart-12.8.57-h5888daf_1 2025-05-07T20:25:23.5508969Z cuda-cudart-dev conda-forge/linux-64::cuda-cudart-dev-12.8.57-h5888daf_1 2025-05-07T20:25:23.5509553Z cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.8.57-h3f2d84a_1 2025-05-07T20:25:23.5510180Z cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.8.57-h5888daf_1 2025-05-07T20:25:23.5510990Z cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.8.57-h3f2d84a_1 2025-05-07T20:25:23.5511620Z cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.8.57-h3f2d84a_1 2025-05-07T20:25:23.5512192Z cuda-cuobjdump conda-forge/linux-64::cuda-cuobjdump-12.8.55-hbd13f7d_0 2025-05-07T20:25:23.5512727Z cuda-cupti conda-forge/linux-64::cuda-cupti-12.8.57-hbd13f7d_0 2025-05-07T20:25:23.5513251Z cuda-cupti-dev conda-forge/linux-64::cuda-cupti-dev-12.8.57-h5888daf_0 2025-05-07T20:25:23.5513801Z cuda-cuxxfilt conda-forge/linux-64::cuda-cuxxfilt-12.8.55-hbd13f7d_0 2025-05-07T20:25:23.5514350Z cuda-driver-dev conda-forge/linux-64::cuda-driver-dev-12.8.57-h5888daf_1 2025-05-07T20:25:23.5514939Z cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.8.90-h3f2d84a_1 2025-05-07T20:25:23.5515615Z cuda-gdb conda-forge/linux-64::cuda-gdb-12.8.55-h50b4baa_0 2025-05-07T20:25:23.5516116Z cuda-libraries conda-forge/linux-64::cuda-libraries-12.8.0-ha770c72_0 2025-05-07T20:25:23.5516707Z cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.8.0-ha770c72_0 2025-05-07T20:25:23.5517273Z cuda-nsight conda-forge/linux-64::cuda-nsight-12.8.55-h7938cbb_0 2025-05-07T20:25:23.5517772Z cuda-nvcc conda-forge/linux-64::cuda-nvcc-12.8.61-hcdd1206_0 2025-05-07T20:25:23.5518302Z cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.8.61-he91c749_1 2025-05-07T20:25:23.5518885Z cuda-nvcc-impl conda-forge/linux-64::cuda-nvcc-impl-12.8.61-h85509e4_1 2025-05-07T20:25:23.5519447Z cuda-nvcc-tools conda-forge/linux-64::cuda-nvcc-tools-12.8.61-he02047a_1 2025-05-07T20:25:23.5520021Z cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.8.61-h04802cd_0 2025-05-07T20:25:23.5520577Z cuda-nvdisasm conda-forge/linux-64::cuda-nvdisasm-12.8.55-hbd13f7d_0 2025-05-07T20:25:23.5521123Z cuda-nvml-dev conda-forge/linux-64::cuda-nvml-dev-12.8.55-hbd13f7d_0 2025-05-07T20:25:23.5521651Z cuda-nvprof conda-forge/linux-64::cuda-nvprof-12.8.57-hbd13f7d_0 2025-05-07T20:25:23.5522170Z cuda-nvprune conda-forge/linux-64::cuda-nvprune-12.8.55-hbd13f7d_0 2025-05-07T20:25:23.5522673Z cuda-nvrtc conda-forge/linux-64::cuda-nvrtc-12.8.61-hbd13f7d_0 2025-05-07T20:25:23.5523188Z cuda-nvrtc-dev conda-forge/linux-64::cuda-nvrtc-dev-12.8.61-h5888daf_0 2025-05-07T20:25:23.5523700Z cuda-nvtx conda-forge/linux-64::cuda-nvtx-12.8.55-hbd13f7d_0 2025-05-07T20:25:23.5524233Z cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.8.61-ha770c72_1 2025-05-07T20:25:23.5524804Z cuda-nvvm-impl conda-forge/linux-64::cuda-nvvm-impl-12.8.61-he02047a_1 2025-05-07T20:25:23.5525564Z cuda-nvvm-tools conda-forge/linux-64::cuda-nvvm-tools-12.8.61-he02047a_1 2025-05-07T20:25:23.5526267Z cuda-nvvp conda-forge/linux-64::cuda-nvvp-12.8.57-hbd13f7d_0 2025-05-07T20:25:23.5526885Z cuda-opencl conda-forge/linux-64::cuda-opencl-12.8.55-hbd13f7d_0 2025-05-07T20:25:23.5527421Z cuda-opencl-dev conda-forge/linux-64::cuda-opencl-dev-12.8.55-h5888daf_0 2025-05-07T20:25:23.5528011Z cuda-profiler-api conda-forge/linux-64::cuda-profiler-api-12.8.55-h7938cbb_0 2025-05-07T20:25:23.5528574Z cuda-runtime conda-forge/noarch::cuda-runtime-12.8.0-ha804496_0 2025-05-07T20:25:23.5529141Z cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.8.55-hbd13f7d_0 2025-05-07T20:25:23.5529909Z cuda-toolkit conda-forge/noarch::cuda-toolkit-12.8.0-ha804496_0 2025-05-07T20:25:23.5530576Z cuda-tools conda-forge/linux-64::cuda-tools-12.8.0-ha770c72_0 2025-05-07T20:25:23.5531098Z cuda-version conda-forge/noarch::cuda-version-12.8-h5d125a7_3 2025-05-07T20:25:23.5531635Z cuda-visual-tools conda-forge/linux-64::cuda-visual-tools-12.8.0-ha770c72_0 2025-05-07T20:25:23.5532283Z cxx-compiler conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 2025-05-07T20:25:23.5532745Z dbus conda-forge/linux-64::dbus-1.13.6-h5008d03_3 2025-05-07T20:25:23.5533382Z font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 2025-05-07T20:25:23.5534004Z font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 2025-05-07T20:25:23.5534625Z font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 2025-05-07T20:25:23.5535210Z font-ttf-ubuntu conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 2025-05-07T20:25:23.5535729Z fontconfig conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 2025-05-07T20:25:23.5536232Z fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 2025-05-07T20:25:23.5536736Z fonts-conda-forge conda-forge/noarch::fonts-conda-forge-1-0 2025-05-07T20:25:23.5537373Z freetype conda-forge/linux-64::freetype-2.13.3-ha770c72_1 2025-05-07T20:25:23.5538073Z gcc conda-forge/linux-64::gcc-11.4.0-h602e360_13 2025-05-07T20:25:23.5538570Z gds-tools conda-forge/linux-64::gds-tools-1.13.0.11-h5888daf_0 2025-05-07T20:25:23.5539010Z gmp conda-forge/linux-64::gmp-6.3.0-hac33072_2 2025-05-07T20:25:23.5539399Z gxx conda-forge/linux-64::gxx-11.4.0-h602e360_13 2025-05-07T20:25:23.5539818Z keyutils conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 2025-05-07T20:25:23.5540235Z krb5 conda-forge/linux-64::krb5-1.21.3-h659f571_0 2025-05-07T20:25:23.5540645Z libcap conda-forge/linux-64::libcap-2.71-h39aace5_0 2025-05-07T20:25:23.5541105Z libcublas conda-forge/linux-64::libcublas-12.8.3.14-h9ab20c4_0 2025-05-07T20:25:23.5541623Z libcublas-dev conda-forge/linux-64::libcublas-dev-12.8.3.14-h9ab20c4_0 2025-05-07T20:25:23.5542143Z libcufft conda-forge/linux-64::libcufft-11.3.3.41-hbd13f7d_0 2025-05-07T20:25:23.5542642Z libcufft-dev conda-forge/linux-64::libcufft-dev-11.3.3.41-h5888daf_0 2025-05-07T20:25:23.5553171Z libcufile conda-forge/linux-64::libcufile-1.13.0.11-h12f29b5_0 2025-05-07T20:25:23.5553790Z libcufile-dev conda-forge/linux-64::libcufile-dev-1.13.0.11-h5888daf_0 2025-05-07T20:25:23.5554525Z libcurand conda-forge/linux-64::libcurand-10.3.9.55-hbd13f7d_0 2025-05-07T20:25:23.5555166Z libcurand-dev conda-forge/linux-64::libcurand-dev-10.3.9.55-h5888daf_0 2025-05-07T20:25:23.5555711Z libcusolver conda-forge/linux-64::libcusolver-11.7.2.55-h9ab20c4_0 2025-05-07T20:25:23.5556269Z libcusolver-dev conda-forge/linux-64::libcusolver-dev-11.7.2.55-h9ab20c4_0 2025-05-07T20:25:23.5556829Z libcusparse conda-forge/linux-64::libcusparse-12.5.7.53-hbd13f7d_0 2025-05-07T20:25:23.5557379Z libcusparse-dev conda-forge/linux-64::libcusparse-dev-12.5.7.53-h5888daf_0 2025-05-07T20:25:23.5557918Z libedit conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2 2025-05-07T20:25:23.5558426Z libfreetype conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 2025-05-07T20:25:23.5558949Z libfreetype6 conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 2025-05-07T20:25:23.5559480Z libgcrypt-lib conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 2025-05-07T20:25:23.5559981Z libglib conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 2025-05-07T20:25:23.5560441Z libglvnd conda-forge/linux-64::libglvnd-1.7.0-ha4b6fd6_2 2025-05-07T20:25:23.5560934Z libgpg-error conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 2025-05-07T20:25:23.5561416Z libiconv conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 2025-05-07T20:25:23.5561868Z libnl conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 2025-05-07T20:25:23.5562322Z libnpp conda-forge/linux-64::libnpp-12.3.3.65-hbd13f7d_0 2025-05-07T20:25:23.5562806Z libnpp-dev conda-forge/linux-64::libnpp-dev-12.3.3.65-h5888daf_0 2025-05-07T20:25:23.5563297Z libnuma conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 2025-05-07T20:25:23.5563790Z libnvfatbin conda-forge/linux-64::libnvfatbin-12.8.55-hbd13f7d_0 2025-05-07T20:25:23.5564483Z libnvfatbin-dev conda-forge/linux-64::libnvfatbin-dev-12.8.55-h5888daf_0 2025-05-07T20:25:23.5565038Z libnvjitlink conda-forge/linux-64::libnvjitlink-12.8.61-hbd13f7d_0 2025-05-07T20:25:23.5565612Z libnvjitlink-dev conda-forge/linux-64::libnvjitlink-dev-12.8.61-h5888daf_0 2025-05-07T20:25:23.5566170Z libnvjpeg conda-forge/linux-64::libnvjpeg-12.3.5.57-h97fd463_0 2025-05-07T20:25:23.5566703Z libnvjpeg-dev conda-forge/linux-64::libnvjpeg-dev-12.3.5.57-ha770c72_0 2025-05-07T20:25:23.5567223Z libopengl conda-forge/linux-64::libopengl-1.7.0-ha4b6fd6_2 2025-05-07T20:25:23.5567775Z libpng conda-forge/linux-64::libpng-1.6.47-h943b412_0 2025-05-07T20:25:23.5568421Z libsystemd0 conda-forge/linux-64::libsystemd0-256.9-h2774228_0 2025-05-07T20:25:23.5569150Z libudev1 conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 2025-05-07T20:25:23.5569594Z libxcb conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 2025-05-07T20:25:23.5570089Z libxkbcommon conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 2025-05-07T20:25:23.5570602Z libxkbfile conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 2025-05-07T20:25:23.5571073Z libxml2 conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 2025-05-07T20:25:23.5571501Z lz4-c conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 2025-05-07T20:25:23.5572009Z nsight-compute conda-forge/linux-64::nsight-compute-2025.1.0.14-hb5ebaad_0 2025-05-07T20:25:23.5572600Z nspr conda-forge/linux-64::nspr-4.36-h5888daf_0 2025-05-07T20:25:23.5572982Z nss conda-forge/linux-64::nss-3.111-h159eef7_0 2025-05-07T20:25:23.5573397Z ocl-icd conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 2025-05-07T20:25:23.5573917Z opencl-headers conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 2025-05-07T20:25:23.5574425Z pcre2 conda-forge/linux-64::pcre2-10.44-hc749103_2 2025-05-07T20:25:23.5574907Z pthread-stubs conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 2025-05-07T20:25:23.5575416Z rdma-core conda-forge/linux-64::rdma-core-55.0-h5888daf_0 2025-05-07T20:25:23.5575874Z wayland conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 2025-05-07T20:25:23.5576322Z xcb-util conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 2025-05-07T20:25:23.5576824Z xcb-util-cursor conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 2025-05-07T20:25:23.5577376Z xcb-util-image conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 2025-05-07T20:25:23.5577937Z xcb-util-keysyms conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 2025-05-07T20:25:23.5578538Z xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 2025-05-07T20:25:23.5579102Z xcb-util-wm conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 2025-05-07T20:25:23.5579633Z xkeyboard-config conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 2025-05-07T20:25:23.5580253Z xorg-libice conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 2025-05-07T20:25:23.5580914Z xorg-libsm conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 2025-05-07T20:25:23.5581484Z xorg-libx11 conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 2025-05-07T20:25:23.5581988Z xorg-libxau conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 2025-05-07T20:25:23.5582562Z xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 2025-05-07T20:25:23.5583158Z xorg-libxdamage conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 2025-05-07T20:25:23.5583716Z xorg-libxdmcp conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 2025-05-07T20:25:23.5584248Z xorg-libxext conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 2025-05-07T20:25:23.5584790Z xorg-libxfixes conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 2025-05-07T20:25:23.5585303Z xorg-libxi conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 2025-05-07T20:25:23.5585984Z xorg-libxrandr conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 2025-05-07T20:25:23.5586559Z xorg-libxrender conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 2025-05-07T20:25:23.5587112Z xorg-libxtst conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 2025-05-07T20:25:23.5587577Z zstd conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 2025-05-07T20:25:23.5587841Z 2025-05-07T20:25:23.5587962Z The following packages will be UPDATED: 2025-05-07T20:25:23.5588175Z 2025-05-07T20:25:23.5588346Z libsqlite 3.46.0-hde9e2c9_0 --> 3.49.2-hee588c1_0 2025-05-07T20:25:23.5588769Z libzlib 1.2.13-h4ab18f5_6 --> 1.3.1-hb9d3cd8_2 2025-05-07T20:25:23.5589158Z zlib 1.2.13-h4ab18f5_6 --> 1.3.1-hb9d3cd8_2 2025-05-07T20:25:23.5589506Z 2025-05-07T20:25:23.5589728Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:25:23.5590048Z 2025-05-07T20:25:23.5590336Z sqlite pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.32.3-hcee41ef_1 2025-05-07T20:25:23.5590948Z tk pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 2025-05-07T20:25:23.5591414Z 2025-05-07T20:25:23.5591441Z 2025-05-07T20:25:23.5591446Z 2025-05-07T20:25:23.5591656Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:23.5592154Z libcublas-12.8.3.14 | 460.2 MB | | 0% 2025-05-07T20:25:23.5592402Z 2025-05-07T20:25:23.5592818Z nsight-compute-2025. | 320.6 MB | | 0%  2025-05-07T20:25:23.5593077Z 2025-05-07T20:25:23.5593081Z 2025-05-07T20:25:23.5593312Z libcusparse-12.5.7.5 | 164.9 MB | | 0%  2025-05-07T20:25:23.5593581Z 2025-05-07T20:25:23.5593585Z 2025-05-07T20:25:23.5593589Z 2025-05-07T20:25:23.5593821Z libcusolver-11.7.2.5 | 156.9 MB | | 0%  2025-05-07T20:25:23.5594084Z 2025-05-07T20:25:23.5594088Z 2025-05-07T20:25:23.5594097Z 2025-05-07T20:25:23.5594101Z 2025-05-07T20:25:23.5594333Z libcufft-11.3.3.41 | 147.4 MB | | 0%  2025-05-07T20:25:23.5594596Z 2025-05-07T20:25:23.5594600Z 2025-05-07T20:25:23.5594603Z 2025-05-07T20:25:23.5594607Z 2025-05-07T20:25:23.5594610Z 2025-05-07T20:25:23.5604592Z libnpp-12.3.3.65 | 130.6 MB | | 0%  2025-05-07T20:25:23.5604974Z 2025-05-07T20:25:23.5604980Z 2025-05-07T20:25:23.5604985Z 2025-05-07T20:25:23.5604990Z 2025-05-07T20:25:23.5604995Z 2025-05-07T20:25:23.5605004Z 2025-05-07T20:25:23.5608245Z cuda-nsight-12.8.55 | 113.2 MB | | 0%  2025-05-07T20:25:23.5608650Z 2025-05-07T20:25:23.5608655Z 2025-05-07T20:25:23.5608660Z 2025-05-07T20:25:23.5608676Z 2025-05-07T20:25:23.5608682Z 2025-05-07T20:25:23.5608687Z 2025-05-07T20:25:23.5608692Z 2025-05-07T20:25:23.5609610Z cuda-nvvp-12.8.57 | 112.4 MB | | 0%  2025-05-07T20:25:23.5610005Z 2025-05-07T20:25:23.5610024Z 2025-05-07T20:25:23.5610029Z 2025-05-07T20:25:23.5610034Z 2025-05-07T20:25:23.5610039Z 2025-05-07T20:25:23.5610044Z 2025-05-07T20:25:23.5610049Z 2025-05-07T20:25:23.5610054Z 2025-05-07T20:25:23.5612347Z cuda-nvrtc-12.8.61 | 63.1 MB | | 0%  2025-05-07T20:25:23.5612746Z 2025-05-07T20:25:23.5612751Z 2025-05-07T20:25:23.5612756Z 2025-05-07T20:25:23.5612761Z 2025-05-07T20:25:23.5612766Z 2025-05-07T20:25:23.5612771Z 2025-05-07T20:25:23.5612776Z 2025-05-07T20:25:23.5612781Z 2025-05-07T20:25:23.5614267Z 2025-05-07T20:25:23.5615874Z libcurand-10.3.9.55 | 43.6 MB | | 0%  2025-05-07T20:25:23.5616212Z 2025-05-07T20:25:23.5616215Z 2025-05-07T20:25:23.5616219Z 2025-05-07T20:25:23.5616230Z 2025-05-07T20:25:23.5616234Z 2025-05-07T20:25:23.5616244Z 2025-05-07T20:25:23.5616247Z 2025-05-07T20:25:23.5616251Z 2025-05-07T20:25:23.5616254Z 2025-05-07T20:25:23.5616258Z 2025-05-07T20:25:23.5618636Z gds-tools-1.13.0.11 | 37.9 MB | | 0%  2025-05-07T20:25:23.5618959Z 2025-05-07T20:25:23.5618963Z 2025-05-07T20:25:23.5618967Z 2025-05-07T20:25:23.5618971Z 2025-05-07T20:25:23.5618974Z 2025-05-07T20:25:23.5618987Z 2025-05-07T20:25:23.5618991Z 2025-05-07T20:25:23.5618994Z 2025-05-07T20:25:23.5618998Z 2025-05-07T20:25:23.5619001Z 2025-05-07T20:25:23.5619010Z 2025-05-07T20:25:23.5621839Z libnvjitlink-12.8.61 | 28.7 MB | | 0%  2025-05-07T20:25:23.5622178Z 2025-05-07T20:25:23.5622182Z 2025-05-07T20:25:23.5622186Z 2025-05-07T20:25:23.5622189Z 2025-05-07T20:25:23.5622193Z 2025-05-07T20:25:23.5622197Z 2025-05-07T20:25:23.5622200Z 2025-05-07T20:25:23.5622204Z 2025-05-07T20:25:23.5622933Z 2025-05-07T20:25:23.5622938Z 2025-05-07T20:25:23.5622941Z 2025-05-07T20:25:23.5622945Z 2025-05-07T20:25:23.5623481Z cuda-nvcc-tools-12.8 | 24.5 MB | | 0%  2025-05-07T20:25:23.5623929Z 2025-05-07T20:25:23.5623935Z 2025-05-07T20:25:23.5623940Z 2025-05-07T20:25:23.5623945Z 2025-05-07T20:25:23.5623950Z 2025-05-07T20:25:23.5623955Z 2025-05-07T20:25:23.5623960Z 2025-05-07T20:25:23.5623965Z 2025-05-07T20:25:23.5623970Z 2025-05-07T20:25:23.5623975Z 2025-05-07T20:25:23.5623980Z 2025-05-07T20:25:23.5623985Z 2025-05-07T20:25:23.5623990Z 2025-05-07T20:25:23.5625053Z cuda-nvvm-tools-12.8 | 23.5 MB | | 0%  2025-05-07T20:25:23.5625494Z 2025-05-07T20:25:23.5625500Z 2025-05-07T20:25:23.5625505Z 2025-05-07T20:25:23.5625510Z 2025-05-07T20:25:23.5625515Z 2025-05-07T20:25:23.5625520Z 2025-05-07T20:25:23.5625525Z 2025-05-07T20:25:23.5625530Z 2025-05-07T20:25:23.5625535Z 2025-05-07T20:25:23.5625555Z 2025-05-07T20:25:23.5625560Z 2025-05-07T20:25:23.5625565Z 2025-05-07T20:25:23.5625570Z 2025-05-07T20:25:23.5625575Z 2025-05-07T20:25:23.5629171Z cuda-nvvm-impl-12.8. | 20.8 MB | | 0%  2025-05-07T20:25:23.5629604Z 2025-05-07T20:25:23.5629609Z 2025-05-07T20:25:23.5629615Z 2025-05-07T20:25:23.5629620Z 2025-05-07T20:25:23.5629634Z 2025-05-07T20:25:23.5629639Z 2025-05-07T20:25:23.5629644Z 2025-05-07T20:25:23.5629649Z 2025-05-07T20:25:23.5629655Z 2025-05-07T20:25:23.5629660Z 2025-05-07T20:25:23.5629665Z 2025-05-07T20:25:23.5629670Z 2025-05-07T20:25:23.5629675Z 2025-05-07T20:25:23.5629680Z 2025-05-07T20:25:23.5629685Z 2025-05-07T20:25:23.5630773Z cuda-nvcc-dev_linux- | 12.7 MB | | 0%  2025-05-07T20:25:23.5631217Z 2025-05-07T20:25:23.5631222Z 2025-05-07T20:25:23.5631235Z 2025-05-07T20:25:23.5631240Z 2025-05-07T20:25:23.5631245Z 2025-05-07T20:25:23.5631250Z 2025-05-07T20:25:23.5631264Z 2025-05-07T20:25:23.5631269Z 2025-05-07T20:25:23.5631274Z 2025-05-07T20:25:23.5631279Z 2025-05-07T20:25:23.5631284Z 2025-05-07T20:25:23.5631289Z 2025-05-07T20:25:23.5631294Z 2025-05-07T20:25:23.5631299Z 2025-05-07T20:25:23.5631309Z 2025-05-07T20:25:23.5631314Z 2025-05-07T20:25:23.5632569Z cuda-sanitizer-api-1 | 8.8 MB | | 0%  2025-05-07T20:25:23.5633022Z 2025-05-07T20:25:23.5633027Z 2025-05-07T20:25:23.5633032Z 2025-05-07T20:25:23.5633046Z 2025-05-07T20:25:23.5633051Z 2025-05-07T20:25:23.5633065Z 2025-05-07T20:25:23.5633070Z 2025-05-07T20:25:23.5633075Z 2025-05-07T20:25:23.5633080Z 2025-05-07T20:25:23.5633085Z 2025-05-07T20:25:23.5633091Z 2025-05-07T20:25:23.5633096Z 2025-05-07T20:25:23.5633101Z 2025-05-07T20:25:23.5633106Z 2025-05-07T20:25:23.5633111Z 2025-05-07T20:25:23.5633116Z 2025-05-07T20:25:23.5633121Z 2025-05-07T20:25:23.5634210Z cuda-nvdisasm-12.8.5 | 4.9 MB | | 0%  2025-05-07T20:25:23.5634658Z 2025-05-07T20:25:23.5634663Z 2025-05-07T20:25:23.5634677Z 2025-05-07T20:25:23.5634682Z 2025-05-07T20:25:23.5634687Z 2025-05-07T20:25:23.5634692Z 2025-05-07T20:25:23.5634840Z 2025-05-07T20:25:23.5634847Z 2025-05-07T20:25:23.5634852Z 2025-05-07T20:25:23.5634857Z 2025-05-07T20:25:23.5634862Z 2025-05-07T20:25:23.5634867Z 2025-05-07T20:25:23.5634872Z 2025-05-07T20:25:23.5634877Z 2025-05-07T20:25:23.5634882Z 2025-05-07T20:25:23.5634896Z 2025-05-07T20:25:23.5634901Z 2025-05-07T20:25:23.5634907Z 2025-05-07T20:25:23.5635938Z cuda-cupti-dev-12.8. | 4.0 MB | | 0%  2025-05-07T20:25:23.5636396Z 2025-05-07T20:25:23.5636401Z 2025-05-07T20:25:23.5636407Z 2025-05-07T20:25:23.5636412Z 2025-05-07T20:25:23.5636417Z 2025-05-07T20:25:23.5636422Z 2025-05-07T20:25:23.5636427Z 2025-05-07T20:25:23.5636432Z 2025-05-07T20:25:23.5636437Z 2025-05-07T20:25:23.5636442Z 2025-05-07T20:25:23.5636566Z 2025-05-07T20:25:23.5636571Z 2025-05-07T20:25:23.5636576Z 2025-05-07T20:25:23.5636581Z 2025-05-07T20:25:23.5636586Z 2025-05-07T20:25:23.5636591Z 2025-05-07T20:25:23.5636596Z 2025-05-07T20:25:23.5636601Z 2025-05-07T20:25:23.5636613Z 2025-05-07T20:25:23.6532337Z ... (more hidden) ... 2025-05-07T20:25:23.6532784Z 2025-05-07T20:25:23.6540355Z nsight-compute-2025. | 320.6 MB | | 0%  2025-05-07T20:25:23.6540727Z 2025-05-07T20:25:23.6540733Z 2025-05-07T20:25:23.6569499Z libcusparse-12.5.7.5 | 164.9 MB | | 0%  2025-05-07T20:25:23.6569881Z 2025-05-07T20:25:23.6569887Z 2025-05-07T20:25:23.6569892Z 2025-05-07T20:25:23.6569897Z 2025-05-07T20:25:23.6605474Z libcufft-11.3.3.41 | 147.4 MB | 5 | 6%  2025-05-07T20:25:23.6686558Z libcublas-12.8.3.14 | 460.2 MB | | 0% 2025-05-07T20:25:23.6686909Z 2025-05-07T20:25:23.6686923Z 2025-05-07T20:25:23.6691810Z 2025-05-07T20:25:23.7534749Z libcusolver-11.7.2.5 | 156.9 MB | | 0%  2025-05-07T20:25:23.7536711Z 2025-05-07T20:25:23.7543017Z nsight-compute-2025. | 320.6 MB | 1 | 1%  2025-05-07T20:25:23.7543373Z 2025-05-07T20:25:23.7543401Z 2025-05-07T20:25:23.7616596Z libcusparse-12.5.7.5 | 164.9 MB | 2 | 2%  2025-05-07T20:25:23.7688396Z libcublas-12.8.3.14 | 460.2 MB | | 0% 2025-05-07T20:25:23.7688752Z 2025-05-07T20:25:23.7688758Z 2025-05-07T20:25:23.7690093Z 2025-05-07T20:25:23.8546105Z libcusolver-11.7.2.5 | 156.9 MB | 2 | 3%  2025-05-07T20:25:23.8546509Z 2025-05-07T20:25:23.8548349Z 2025-05-07T20:25:23.8617627Z libcusparse-12.5.7.5 | 164.9 MB | 4 | 5%  2025-05-07T20:25:23.8625345Z libcublas-12.8.3.14 | 460.2 MB | | 1% 2025-05-07T20:25:23.8625693Z 2025-05-07T20:25:23.8625699Z 2025-05-07T20:25:23.8625712Z 2025-05-07T20:25:23.8625718Z 2025-05-07T20:25:23.8688490Z libcufft-11.3.3.41 | 147.4 MB | #1 | 11%  2025-05-07T20:25:23.8688881Z 2025-05-07T20:25:23.8688886Z 2025-05-07T20:25:23.8689422Z 2025-05-07T20:25:23.8708856Z libcusolver-11.7.2.5 | 156.9 MB | 5 | 5%  2025-05-07T20:25:23.8709351Z 2025-05-07T20:25:23.9548807Z nsight-compute-2025. | 320.6 MB | 1 | 2%  2025-05-07T20:25:23.9549167Z 2025-05-07T20:25:23.9550455Z 2025-05-07T20:25:23.9621572Z libcusparse-12.5.7.5 | 164.9 MB | 7 | 7%  2025-05-07T20:25:23.9693002Z libcublas-12.8.3.14 | 460.2 MB | 1 | 1% 2025-05-07T20:25:23.9693272Z 2025-05-07T20:25:23.9693276Z 2025-05-07T20:25:23.9693280Z 2025-05-07T20:25:23.9708573Z libcusolver-11.7.2.5 | 156.9 MB | 7 | 8%  2025-05-07T20:25:23.9708872Z 2025-05-07T20:25:24.0036179Z nsight-compute-2025. | 320.6 MB | 2 | 3%  2025-05-07T20:25:24.0036472Z 2025-05-07T20:25:24.0036477Z 2025-05-07T20:25:24.0036482Z 2025-05-07T20:25:24.0036487Z 2025-05-07T20:25:24.0628347Z libcufft-11.3.3.41 | 147.4 MB | #4 | 15%  2025-05-07T20:25:24.0665760Z libcublas-12.8.3.14 | 460.2 MB | 2 | 2% 2025-05-07T20:25:24.0666064Z 2025-05-07T20:25:24.0666069Z 2025-05-07T20:25:24.0697307Z libcusparse-12.5.7.5 | 164.9 MB | 9 | 9%  2025-05-07T20:25:24.0697610Z 2025-05-07T20:25:24.0697614Z 2025-05-07T20:25:24.0698051Z 2025-05-07T20:25:24.0714176Z libcusolver-11.7.2.5 | 156.9 MB | # | 10%  2025-05-07T20:25:24.0714470Z 2025-05-07T20:25:24.1483812Z nsight-compute-2025. | 320.6 MB | 3 | 4%  2025-05-07T20:25:24.1484099Z 2025-05-07T20:25:24.1484103Z 2025-05-07T20:25:24.1484107Z 2025-05-07T20:25:24.1485486Z 2025-05-07T20:25:24.1633361Z libcufft-11.3.3.41 | 147.4 MB | #8 | 18%  2025-05-07T20:25:24.1694500Z libcublas-12.8.3.14 | 460.2 MB | 2 | 3% 2025-05-07T20:25:24.1694770Z 2025-05-07T20:25:24.1695712Z 2025-05-07T20:25:24.1732681Z libcusparse-12.5.7.5 | 164.9 MB | #1 | 11%  2025-05-07T20:25:24.1733425Z 2025-05-07T20:25:24.1789778Z nsight-compute-2025. | 320.6 MB | 4 | 5%  2025-05-07T20:25:24.1790062Z 2025-05-07T20:25:24.1790067Z 2025-05-07T20:25:24.1790072Z 2025-05-07T20:25:24.2637243Z libcusolver-11.7.2.5 | 156.9 MB | #2 | 12%  2025-05-07T20:25:24.2695435Z libcublas-12.8.3.14 | 460.2 MB | 3 | 4% 2025-05-07T20:25:24.2695727Z 2025-05-07T20:25:24.2698404Z 2025-05-07T20:25:24.2720021Z libcusparse-12.5.7.5 | 164.9 MB | #3 | 13%  2025-05-07T20:25:24.2720377Z 2025-05-07T20:25:24.2720383Z 2025-05-07T20:25:24.2720389Z 2025-05-07T20:25:24.2720394Z 2025-05-07T20:25:24.2792570Z libcufft-11.3.3.41 | 147.4 MB | ##1 | 21%  2025-05-07T20:25:24.2792867Z 2025-05-07T20:25:24.2792871Z 2025-05-07T20:25:24.2793600Z 2025-05-07T20:25:24.3142704Z libcusolver-11.7.2.5 | 156.9 MB | #4 | 15%  2025-05-07T20:25:24.3143002Z 2025-05-07T20:25:24.3638578Z nsight-compute-2025. | 320.6 MB | 5 | 6%  2025-05-07T20:25:24.3698102Z libcublas-12.8.3.14 | 460.2 MB | 4 | 4% 2025-05-07T20:25:24.3698363Z 2025-05-07T20:25:24.3698368Z 2025-05-07T20:25:24.3811454Z libcusparse-12.5.7.5 | 164.9 MB | #5 | 15%  2025-05-07T20:25:24.3811851Z 2025-05-07T20:25:24.3811858Z 2025-05-07T20:25:24.3811863Z 2025-05-07T20:25:24.3811868Z 2025-05-07T20:25:24.4147144Z libcufft-11.3.3.41 | 147.4 MB | ##4 | 24%  2025-05-07T20:25:24.4149012Z 2025-05-07T20:25:24.4338229Z nsight-compute-2025. | 320.6 MB | 7 | 7%  2025-05-07T20:25:24.4338606Z 2025-05-07T20:25:24.4338612Z 2025-05-07T20:25:24.4338618Z 2025-05-07T20:25:24.4639269Z libcusolver-11.7.2.5 | 156.9 MB | #6 | 17%  2025-05-07T20:25:24.4698983Z libcublas-12.8.3.14 | 460.2 MB | 5 | 5% 2025-05-07T20:25:24.4699246Z 2025-05-07T20:25:24.4699258Z 2025-05-07T20:25:24.4812297Z libcusparse-12.5.7.5 | 164.9 MB | #7 | 18%  2025-05-07T20:25:24.4812616Z 2025-05-07T20:25:24.4812620Z 2025-05-07T20:25:24.4812624Z 2025-05-07T20:25:24.4812627Z 2025-05-07T20:25:24.5148666Z libcufft-11.3.3.41 | 147.4 MB | ##6 | 27%  2025-05-07T20:25:24.5151095Z 2025-05-07T20:25:24.5339710Z nsight-compute-2025. | 320.6 MB | 8 | 8%  2025-05-07T20:25:24.5340143Z 2025-05-07T20:25:24.5340150Z 2025-05-07T20:25:24.5340155Z 2025-05-07T20:25:24.5639969Z libcusolver-11.7.2.5 | 156.9 MB | #8 | 19%  2025-05-07T20:25:24.5793682Z libcublas-12.8.3.14 | 460.2 MB | 6 | 6% 2025-05-07T20:25:24.5793954Z 2025-05-07T20:25:24.5796635Z 2025-05-07T20:25:24.5893084Z libcusparse-12.5.7.5 | 164.9 MB | ## | 20%  2025-05-07T20:25:24.5893483Z 2025-05-07T20:25:24.5893488Z 2025-05-07T20:25:24.5893492Z 2025-05-07T20:25:24.5897357Z 2025-05-07T20:25:24.6149688Z libcufft-11.3.3.41 | 147.4 MB | ##9 | 29%  2025-05-07T20:25:24.6154748Z 2025-05-07T20:25:24.6340493Z nsight-compute-2025. | 320.6 MB | 9 | 9%  2025-05-07T20:25:24.6340911Z 2025-05-07T20:25:24.6340915Z 2025-05-07T20:25:24.6341697Z 2025-05-07T20:25:24.6641921Z libcusolver-11.7.2.5 | 156.9 MB | ##1 | 21%  2025-05-07T20:25:24.6795404Z libcublas-12.8.3.14 | 460.2 MB | 6 | 7% 2025-05-07T20:25:24.6795779Z 2025-05-07T20:25:24.6796410Z 2025-05-07T20:25:24.6955686Z libcusparse-12.5.7.5 | 164.9 MB | ##2 | 22%  2025-05-07T20:25:24.6956050Z 2025-05-07T20:25:24.6956054Z 2025-05-07T20:25:24.6956057Z 2025-05-07T20:25:24.6956529Z 2025-05-07T20:25:24.7154480Z libcufft-11.3.3.41 | 147.4 MB | ###1 | 32%  2025-05-07T20:25:24.7154876Z 2025-05-07T20:25:24.7342458Z nsight-compute-2025. | 320.6 MB | # | 10%  2025-05-07T20:25:24.7342861Z 2025-05-07T20:25:24.7342867Z 2025-05-07T20:25:24.7342872Z 2025-05-07T20:25:24.7691144Z libcusolver-11.7.2.5 | 156.9 MB | ##3 | 23%  2025-05-07T20:25:24.7798055Z libcublas-12.8.3.14 | 460.2 MB | 7 | 8% 2025-05-07T20:25:24.7798685Z 2025-05-07T20:25:24.7798690Z 2025-05-07T20:25:24.8064392Z libcusparse-12.5.7.5 | 164.9 MB | ##4 | 25%  2025-05-07T20:25:24.8064747Z 2025-05-07T20:25:24.8064752Z 2025-05-07T20:25:24.8064777Z 2025-05-07T20:25:24.8064783Z 2025-05-07T20:25:24.8161312Z libcufft-11.3.3.41 | 147.4 MB | ###4 | 35%  2025-05-07T20:25:24.8161765Z 2025-05-07T20:25:24.8345020Z nsight-compute-2025. | 320.6 MB | #1 | 11%  2025-05-07T20:25:24.8345391Z 2025-05-07T20:25:24.8345397Z 2025-05-07T20:25:24.8345816Z 2025-05-07T20:25:24.8745094Z libcusolver-11.7.2.5 | 156.9 MB | ##5 | 26%  2025-05-07T20:25:24.8800549Z libcublas-12.8.3.14 | 460.2 MB | 8 | 8% 2025-05-07T20:25:24.8800817Z 2025-05-07T20:25:24.8801565Z 2025-05-07T20:25:24.9111618Z libcusparse-12.5.7.5 | 164.9 MB | ##7 | 27%  2025-05-07T20:25:24.9111941Z 2025-05-07T20:25:24.9111945Z 2025-05-07T20:25:24.9111949Z 2025-05-07T20:25:24.9111983Z 2025-05-07T20:25:24.9175662Z libcufft-11.3.3.41 | 147.4 MB | ###7 | 37%  2025-05-07T20:25:24.9177424Z 2025-05-07T20:25:24.9362079Z nsight-compute-2025. | 320.6 MB | #2 | 12%  2025-05-07T20:25:24.9362388Z 2025-05-07T20:25:24.9362394Z 2025-05-07T20:25:24.9362680Z 2025-05-07T20:25:24.9746496Z libcusolver-11.7.2.5 | 156.9 MB | ##8 | 28%  2025-05-07T20:25:24.9807359Z libcublas-12.8.3.14 | 460.2 MB | 9 | 9% 2025-05-07T20:25:24.9807714Z 2025-05-07T20:25:24.9807719Z 2025-05-07T20:25:25.0118785Z libcusparse-12.5.7.5 | 164.9 MB | ##9 | 30%  2025-05-07T20:25:25.0119076Z 2025-05-07T20:25:25.0119080Z 2025-05-07T20:25:25.0119085Z 2025-05-07T20:25:25.0119088Z 2025-05-07T20:25:25.0177724Z libcufft-11.3.3.41 | 147.4 MB | ###9 | 40%  2025-05-07T20:25:25.0180218Z 2025-05-07T20:25:25.0366326Z nsight-compute-2025. | 320.6 MB | #3 | 14%  2025-05-07T20:25:25.0366634Z 2025-05-07T20:25:25.0366672Z 2025-05-07T20:25:25.0366928Z 2025-05-07T20:25:25.0747905Z libcusolver-11.7.2.5 | 156.9 MB | ### | 31%  2025-05-07T20:25:25.0846712Z libcublas-12.8.3.14 | 460.2 MB | # | 10% 2025-05-07T20:25:25.0846973Z 2025-05-07T20:25:25.0848595Z 2025-05-07T20:25:25.1178872Z libcusparse-12.5.7.5 | 164.9 MB | ###1 | 32%  2025-05-07T20:25:25.1180670Z 2025-05-07T20:25:25.1197423Z nsight-compute-2025. | 320.6 MB | #4 | 15%  2025-05-07T20:25:25.1197684Z 2025-05-07T20:25:25.1197688Z 2025-05-07T20:25:25.1198912Z 2025-05-07T20:25:25.1198917Z 2025-05-07T20:25:25.1367052Z libcufft-11.3.3.41 | 147.4 MB | ####2 | 42%  2025-05-07T20:25:25.1367421Z 2025-05-07T20:25:25.1367426Z 2025-05-07T20:25:25.1367431Z 2025-05-07T20:25:25.1749700Z libcusolver-11.7.2.5 | 156.9 MB | ###2 | 33%  2025-05-07T20:25:25.1899482Z libcublas-12.8.3.14 | 460.2 MB | # | 11% 2025-05-07T20:25:25.1899740Z 2025-05-07T20:25:25.1899744Z 2025-05-07T20:25:25.2219281Z libcusparse-12.5.7.5 | 164.9 MB | ###4 | 34%  2025-05-07T20:25:25.2221898Z 2025-05-07T20:25:25.2288960Z nsight-compute-2025. | 320.6 MB | #5 | 16%  2025-05-07T20:25:25.2289226Z 2025-05-07T20:25:25.2289469Z 2025-05-07T20:25:25.2289483Z 2025-05-07T20:25:25.2290165Z 2025-05-07T20:25:25.2374972Z libcufft-11.3.3.41 | 147.4 MB | ####4 | 44%  2025-05-07T20:25:25.2375247Z 2025-05-07T20:25:25.2375251Z 2025-05-07T20:25:25.2376949Z 2025-05-07T20:25:25.2782628Z libcusolver-11.7.2.5 | 156.9 MB | ###5 | 35%  2025-05-07T20:25:25.2901346Z libcublas-12.8.3.14 | 460.2 MB | #1 | 12% 2025-05-07T20:25:25.2901604Z 2025-05-07T20:25:25.2904688Z 2025-05-07T20:25:25.3226508Z libcusparse-12.5.7.5 | 164.9 MB | ###6 | 36%  2025-05-07T20:25:25.3226792Z 2025-05-07T20:25:25.3293101Z nsight-compute-2025. | 320.6 MB | #6 | 17%  2025-05-07T20:25:25.3293420Z 2025-05-07T20:25:25.3293426Z 2025-05-07T20:25:25.3293705Z 2025-05-07T20:25:25.3293711Z 2025-05-07T20:25:25.3430594Z libcufft-11.3.3.41 | 147.4 MB | ####6 | 47%  2025-05-07T20:25:25.3430885Z 2025-05-07T20:25:25.3430890Z 2025-05-07T20:25:25.3432247Z 2025-05-07T20:25:25.3787249Z libcusolver-11.7.2.5 | 156.9 MB | ###7 | 38%  2025-05-07T20:25:25.3949489Z libcublas-12.8.3.14 | 460.2 MB | #2 | 12% 2025-05-07T20:25:25.3949774Z 2025-05-07T20:25:25.3949779Z 2025-05-07T20:25:25.4261535Z libcusparse-12.5.7.5 | 164.9 MB | ###8 | 39%  2025-05-07T20:25:25.4263208Z 2025-05-07T20:25:25.4308765Z nsight-compute-2025. | 320.6 MB | #8 | 18%  2025-05-07T20:25:25.4309037Z 2025-05-07T20:25:25.4309041Z 2025-05-07T20:25:25.4309045Z 2025-05-07T20:25:25.4309049Z 2025-05-07T20:25:25.4434106Z libcufft-11.3.3.41 | 147.4 MB | ####9 | 49%  2025-05-07T20:25:25.4434389Z 2025-05-07T20:25:25.4434393Z 2025-05-07T20:25:25.4434397Z 2025-05-07T20:25:25.4791806Z libcusolver-11.7.2.5 | 156.9 MB | ###9 | 40%  2025-05-07T20:25:25.4983661Z libcublas-12.8.3.14 | 460.2 MB | #3 | 13% 2025-05-07T20:25:25.4983918Z 2025-05-07T20:25:25.4985898Z 2025-05-07T20:25:25.5263234Z libcusparse-12.5.7.5 | 164.9 MB | #### | 41%  2025-05-07T20:25:25.5264721Z 2025-05-07T20:25:25.5310352Z nsight-compute-2025. | 320.6 MB | #9 | 19%  2025-05-07T20:25:25.5310630Z 2025-05-07T20:25:25.5310635Z 2025-05-07T20:25:25.5310639Z 2025-05-07T20:25:25.5310643Z 2025-05-07T20:25:25.5437265Z libcufft-11.3.3.41 | 147.4 MB | #####1 | 52%  2025-05-07T20:25:25.5437553Z 2025-05-07T20:25:25.5437557Z 2025-05-07T20:25:25.5437561Z 2025-05-07T20:25:25.5874142Z libcusolver-11.7.2.5 | 156.9 MB | ####2 | 42%  2025-05-07T20:25:25.5986430Z libcublas-12.8.3.14 | 460.2 MB | #4 | 14% 2025-05-07T20:25:25.5986694Z 2025-05-07T20:25:25.5987972Z 2025-05-07T20:25:25.6269670Z libcusparse-12.5.7.5 | 164.9 MB | ####3 | 43%  2025-05-07T20:25:25.6270646Z 2025-05-07T20:25:25.6316424Z nsight-compute-2025. | 320.6 MB | ## | 20%  2025-05-07T20:25:25.6316704Z 2025-05-07T20:25:25.6316708Z 2025-05-07T20:25:25.6316712Z 2025-05-07T20:25:25.6316715Z 2025-05-07T20:25:25.6440108Z libcufft-11.3.3.41 | 147.4 MB | #####4 | 54%  2025-05-07T20:25:25.6440403Z 2025-05-07T20:25:25.6440407Z 2025-05-07T20:25:25.6440411Z 2025-05-07T20:25:25.6894428Z libcusolver-11.7.2.5 | 156.9 MB | ####4 | 45%  2025-05-07T20:25:25.6990518Z libcublas-12.8.3.14 | 460.2 MB | #4 | 15% 2025-05-07T20:25:25.6990772Z 2025-05-07T20:25:25.6992925Z 2025-05-07T20:25:25.7317546Z libcusparse-12.5.7.5 | 164.9 MB | ####5 | 45%  2025-05-07T20:25:25.7317857Z 2025-05-07T20:25:25.7317862Z 2025-05-07T20:25:25.7317867Z 2025-05-07T20:25:25.7320128Z 2025-05-07T20:25:25.7357738Z libcufft-11.3.3.41 | 147.4 MB | #####6 | 57%  2025-05-07T20:25:25.7358015Z 2025-05-07T20:25:25.7463397Z nsight-compute-2025. | 320.6 MB | ##1 | 21%  2025-05-07T20:25:25.7463730Z 2025-05-07T20:25:25.7463736Z 2025-05-07T20:25:25.7464532Z 2025-05-07T20:25:25.7896754Z libcusolver-11.7.2.5 | 156.9 MB | ####7 | 47%  2025-05-07T20:25:25.8134119Z libcublas-12.8.3.14 | 460.2 MB | #5 | 16% 2025-05-07T20:25:25.8134397Z 2025-05-07T20:25:25.8134889Z 2025-05-07T20:25:25.8371861Z libcusparse-12.5.7.5 | 164.9 MB | ####7 | 48%  2025-05-07T20:25:25.8372354Z 2025-05-07T20:25:25.8372372Z 2025-05-07T20:25:25.8372378Z 2025-05-07T20:25:25.8373211Z 2025-05-07T20:25:25.8384946Z libcufft-11.3.3.41 | 147.4 MB | #####9 | 59%  2025-05-07T20:25:25.8387553Z 2025-05-07T20:25:25.8501870Z nsight-compute-2025. | 320.6 MB | ##2 | 22%  2025-05-07T20:25:25.8502161Z 2025-05-07T20:25:25.8502167Z 2025-05-07T20:25:25.8502173Z 2025-05-07T20:25:25.8934473Z libcusolver-11.7.2.5 | 156.9 MB | ####9 | 49%  2025-05-07T20:25:25.9135157Z libcublas-12.8.3.14 | 460.2 MB | #6 | 16% 2025-05-07T20:25:25.9135745Z 2025-05-07T20:25:25.9136406Z 2025-05-07T20:25:25.9386325Z libcusparse-12.5.7.5 | 164.9 MB | ####9 | 50%  2025-05-07T20:25:25.9386629Z 2025-05-07T20:25:25.9430384Z nsight-compute-2025. | 320.6 MB | ##3 | 23%  2025-05-07T20:25:25.9430785Z 2025-05-07T20:25:25.9430789Z 2025-05-07T20:25:25.9430793Z 2025-05-07T20:25:25.9430801Z 2025-05-07T20:25:25.9537916Z libcufft-11.3.3.41 | 147.4 MB | ######1 | 62%  2025-05-07T20:25:25.9538211Z 2025-05-07T20:25:25.9538215Z 2025-05-07T20:25:25.9538219Z 2025-05-07T20:25:25.9938312Z libcusolver-11.7.2.5 | 156.9 MB | #####1 | 52%  2025-05-07T20:25:26.0151015Z libcublas-12.8.3.14 | 460.2 MB | #7 | 17% 2025-05-07T20:25:26.0151293Z 2025-05-07T20:25:26.0153420Z 2025-05-07T20:25:26.0389573Z libcusparse-12.5.7.5 | 164.9 MB | #####2 | 52%  2025-05-07T20:25:26.0391987Z 2025-05-07T20:25:26.0433365Z nsight-compute-2025. | 320.6 MB | ##4 | 24%  2025-05-07T20:25:26.0433686Z 2025-05-07T20:25:26.0433690Z 2025-05-07T20:25:26.0433694Z 2025-05-07T20:25:26.0436167Z 2025-05-07T20:25:26.0582687Z libcufft-11.3.3.41 | 147.4 MB | ######4 | 64%  2025-05-07T20:25:26.0582988Z 2025-05-07T20:25:26.0582993Z 2025-05-07T20:25:26.0582997Z 2025-05-07T20:25:26.0940653Z libcusolver-11.7.2.5 | 156.9 MB | #####4 | 54%  2025-05-07T20:25:26.1167374Z libcublas-12.8.3.14 | 460.2 MB | #7 | 18% 2025-05-07T20:25:26.1167647Z 2025-05-07T20:25:26.1169837Z 2025-05-07T20:25:26.1450574Z libcusparse-12.5.7.5 | 164.9 MB | #####4 | 54%  2025-05-07T20:25:26.1450888Z 2025-05-07T20:25:26.1450894Z 2025-05-07T20:25:26.1450900Z 2025-05-07T20:25:26.1452822Z 2025-05-07T20:25:26.1467028Z libcufft-11.3.3.41 | 147.4 MB | ######6 | 66%  2025-05-07T20:25:26.1467984Z 2025-05-07T20:25:26.1586773Z nsight-compute-2025. | 320.6 MB | ##5 | 26%  2025-05-07T20:25:26.1587192Z 2025-05-07T20:25:26.1587227Z 2025-05-07T20:25:26.1588651Z 2025-05-07T20:25:26.1942468Z libcusolver-11.7.2.5 | 156.9 MB | #####6 | 56%  2025-05-07T20:25:26.2262134Z libcublas-12.8.3.14 | 460.2 MB | #8 | 19% 2025-05-07T20:25:26.2262429Z 2025-05-07T20:25:26.2262596Z 2025-05-07T20:25:26.2473069Z libcusparse-12.5.7.5 | 164.9 MB | #####6 | 56%  2025-05-07T20:25:26.2473792Z 2025-05-07T20:25:26.2473801Z 2025-05-07T20:25:26.2473809Z 2025-05-07T20:25:26.2474015Z 2025-05-07T20:25:26.2506416Z libcufft-11.3.3.41 | 147.4 MB | ######8 | 69%  2025-05-07T20:25:26.2511664Z 2025-05-07T20:25:26.2743357Z nsight-compute-2025. | 320.6 MB | ##6 | 27%  2025-05-07T20:25:26.2743636Z 2025-05-07T20:25:26.2743641Z 2025-05-07T20:25:26.2743666Z 2025-05-07T20:25:26.2942520Z libcusolver-11.7.2.5 | 156.9 MB | #####8 | 59%  2025-05-07T20:25:26.3268981Z libcublas-12.8.3.14 | 460.2 MB | #9 | 20% 2025-05-07T20:25:26.3269284Z 2025-05-07T20:25:26.3270237Z 2025-05-07T20:25:26.3507647Z libcusparse-12.5.7.5 | 164.9 MB | #####8 | 59%  2025-05-07T20:25:26.3508000Z 2025-05-07T20:25:26.3508004Z 2025-05-07T20:25:26.3508008Z 2025-05-07T20:25:26.3510124Z 2025-05-07T20:25:26.3635447Z libcufft-11.3.3.41 | 147.4 MB | #######1 | 71%  2025-05-07T20:25:26.3635805Z 2025-05-07T20:25:26.3771787Z nsight-compute-2025. | 320.6 MB | ##7 | 28%  2025-05-07T20:25:26.3772141Z 2025-05-07T20:25:26.3772145Z 2025-05-07T20:25:26.3772870Z 2025-05-07T20:25:26.3953495Z libcusolver-11.7.2.5 | 156.9 MB | ###### | 61%  2025-05-07T20:25:26.4273060Z libcublas-12.8.3.14 | 460.2 MB | ## | 20% 2025-05-07T20:25:26.4273399Z 2025-05-07T20:25:26.4273406Z 2025-05-07T20:25:26.4508672Z libcusparse-12.5.7.5 | 164.9 MB | ###### | 61%  2025-05-07T20:25:26.4509310Z 2025-05-07T20:25:26.4509327Z 2025-05-07T20:25:26.4509332Z 2025-05-07T20:25:26.4510082Z 2025-05-07T20:25:26.4748207Z libcufft-11.3.3.41 | 147.4 MB | #######3 | 74%  2025-05-07T20:25:26.4748839Z 2025-05-07T20:25:26.4904325Z nsight-compute-2025. | 320.6 MB | ##8 | 29%  2025-05-07T20:25:26.4904659Z 2025-05-07T20:25:26.4904664Z 2025-05-07T20:25:26.4905840Z 2025-05-07T20:25:26.4958588Z libcusolver-11.7.2.5 | 156.9 MB | ######3 | 63%  2025-05-07T20:25:26.5273263Z libcublas-12.8.3.14 | 460.2 MB | ##1 | 21% 2025-05-07T20:25:26.5273583Z 2025-05-07T20:25:26.5274931Z 2025-05-07T20:25:26.5510989Z libcusparse-12.5.7.5 | 164.9 MB | ######3 | 63%  2025-05-07T20:25:26.5511454Z 2025-05-07T20:25:26.5511459Z 2025-05-07T20:25:26.5511464Z 2025-05-07T20:25:26.5511896Z 2025-05-07T20:25:26.5979361Z libcufft-11.3.3.41 | 147.4 MB | #######6 | 76%  2025-05-07T20:25:26.5981221Z 2025-05-07T20:25:26.5986215Z nsight-compute-2025. | 320.6 MB | ##9 | 30%  2025-05-07T20:25:26.6011924Z libcublas-12.8.3.14 | 460.2 MB | ##2 | 22% 2025-05-07T20:25:26.6012343Z 2025-05-07T20:25:26.6012383Z 2025-05-07T20:25:26.6012397Z 2025-05-07T20:25:26.6273861Z libcusolver-11.7.2.5 | 156.9 MB | ######5 | 65%  2025-05-07T20:25:26.6274287Z 2025-05-07T20:25:26.6274293Z 2025-05-07T20:25:26.6517129Z libcusparse-12.5.7.5 | 164.9 MB | ######5 | 65%  2025-05-07T20:25:26.6517418Z 2025-05-07T20:25:26.6517422Z 2025-05-07T20:25:26.6517425Z 2025-05-07T20:25:26.6517429Z 2025-05-07T20:25:26.7000815Z libcufft-11.3.3.41 | 147.4 MB | #######8 | 79%  2025-05-07T20:25:26.7001918Z 2025-05-07T20:25:26.7049468Z nsight-compute-2025. | 320.6 MB | ### | 30%  2025-05-07T20:25:26.7170044Z libcublas-12.8.3.14 | 460.2 MB | ##2 | 23% 2025-05-07T20:25:26.7171162Z 2025-05-07T20:25:26.7171168Z 2025-05-07T20:25:26.7172316Z 2025-05-07T20:25:26.7279691Z libcusolver-11.7.2.5 | 156.9 MB | ######7 | 67%  2025-05-07T20:25:26.7279986Z 2025-05-07T20:25:26.7280884Z 2025-05-07T20:25:26.7561393Z libcusparse-12.5.7.5 | 164.9 MB | ######7 | 68%  2025-05-07T20:25:26.7561842Z 2025-05-07T20:25:26.7561848Z 2025-05-07T20:25:26.7561853Z 2025-05-07T20:25:26.7561858Z 2025-05-07T20:25:26.8051023Z libcufft-11.3.3.41 | 147.4 MB | ########1 | 81%  2025-05-07T20:25:26.8073239Z libcublas-12.8.3.14 | 460.2 MB | ##3 | 24% 2025-05-07T20:25:26.8076622Z 2025-05-07T20:25:26.8259585Z nsight-compute-2025. | 320.6 MB | ###1 | 31%  2025-05-07T20:25:26.8259863Z 2025-05-07T20:25:26.8259867Z 2025-05-07T20:25:26.8259871Z 2025-05-07T20:25:26.8353890Z libcusolver-11.7.2.5 | 156.9 MB | ######9 | 69%  2025-05-07T20:25:26.8354177Z 2025-05-07T20:25:26.8354181Z 2025-05-07T20:25:26.8616047Z libcusparse-12.5.7.5 | 164.9 MB | ######9 | 70%  2025-05-07T20:25:26.8616366Z 2025-05-07T20:25:26.8616370Z 2025-05-07T20:25:26.8616374Z 2025-05-07T20:25:26.8619285Z 2025-05-07T20:25:26.9067318Z libcufft-11.3.3.41 | 147.4 MB | ########3 | 84%  2025-05-07T20:25:26.9082113Z libcublas-12.8.3.14 | 460.2 MB | ##4 | 24% 2025-05-07T20:25:26.9084963Z 2025-05-07T20:25:26.9305794Z nsight-compute-2025. | 320.6 MB | ###2 | 32%  2025-05-07T20:25:26.9306502Z 2025-05-07T20:25:26.9306510Z 2025-05-07T20:25:26.9306849Z 2025-05-07T20:25:26.9357060Z libcusolver-11.7.2.5 | 156.9 MB | #######1 | 71%  2025-05-07T20:25:26.9357568Z 2025-05-07T20:25:26.9362234Z 2025-05-07T20:25:26.9618909Z libcusparse-12.5.7.5 | 164.9 MB | #######1 | 72%  2025-05-07T20:25:26.9619232Z 2025-05-07T20:25:26.9619238Z 2025-05-07T20:25:26.9619244Z 2025-05-07T20:25:26.9619248Z 2025-05-07T20:25:27.0069598Z libcufft-11.3.3.41 | 147.4 MB | ########6 | 86%  2025-05-07T20:25:27.0307712Z libcublas-12.8.3.14 | 460.2 MB | ##5 | 25% 2025-05-07T20:25:27.0307972Z 2025-05-07T20:25:27.0307976Z 2025-05-07T20:25:27.0308031Z 2025-05-07T20:25:27.0357696Z libcusolver-11.7.2.5 | 156.9 MB | #######3 | 73%  2025-05-07T20:25:27.0358010Z 2025-05-07T20:25:27.0358202Z 2025-05-07T20:25:27.0621525Z libcusparse-12.5.7.5 | 164.9 MB | #######4 | 74%  2025-05-07T20:25:27.0621887Z 2025-05-07T20:25:27.0621891Z 2025-05-07T20:25:27.0621895Z 2025-05-07T20:25:27.0624308Z 2025-05-07T20:25:27.0644216Z libcufft-11.3.3.41 | 147.4 MB | ########9 | 89%  2025-05-07T20:25:27.0650406Z 2025-05-07T20:25:27.1069329Z nsight-compute-2025. | 320.6 MB | ###3 | 33%  2025-05-07T20:25:27.1310209Z libcublas-12.8.3.14 | 460.2 MB | ##6 | 26% 2025-05-07T20:25:27.1310576Z 2025-05-07T20:25:27.1310582Z 2025-05-07T20:25:27.1312076Z 2025-05-07T20:25:27.1358907Z libcusolver-11.7.2.5 | 156.9 MB | #######5 | 75%  2025-05-07T20:25:27.1359190Z 2025-05-07T20:25:27.1359198Z 2025-05-07T20:25:27.1652335Z libcusparse-12.5.7.5 | 164.9 MB | #######7 | 77%  2025-05-07T20:25:27.1652651Z 2025-05-07T20:25:27.1652655Z 2025-05-07T20:25:27.1652659Z 2025-05-07T20:25:27.1652662Z 2025-05-07T20:25:27.1891992Z libcufft-11.3.3.41 | 147.4 MB | #########1 | 92%  2025-05-07T20:25:27.1896019Z 2025-05-07T20:25:27.2137859Z nsight-compute-2025. | 320.6 MB | ###3 | 34%  2025-05-07T20:25:27.2372240Z libcublas-12.8.3.14 | 460.2 MB | ##7 | 27% 2025-05-07T20:25:27.2372510Z 2025-05-07T20:25:27.2373720Z 2025-05-07T20:25:27.2386844Z libcusparse-12.5.7.5 | 164.9 MB | #######9 | 79%  2025-05-07T20:25:27.2387135Z 2025-05-07T20:25:27.2387141Z 2025-05-07T20:25:27.2391155Z 2025-05-07T20:25:27.2653787Z libcusolver-11.7.2.5 | 156.9 MB | #######7 | 77%  2025-05-07T20:25:27.2654391Z 2025-05-07T20:25:27.2654398Z 2025-05-07T20:25:27.2654404Z 2025-05-07T20:25:27.2656519Z 2025-05-07T20:25:27.2892819Z libcufft-11.3.3.41 | 147.4 MB | #########4 | 95%  2025-05-07T20:25:27.2894712Z 2025-05-07T20:25:27.3145765Z nsight-compute-2025. | 320.6 MB | ###4 | 35%  2025-05-07T20:25:27.3469669Z libcublas-12.8.3.14 | 460.2 MB | ##7 | 28% 2025-05-07T20:25:27.3470046Z 2025-05-07T20:25:27.3470052Z 2025-05-07T20:25:27.3471212Z 2025-05-07T20:25:27.3476024Z libcusolver-11.7.2.5 | 156.9 MB | #######9 | 79%  2025-05-07T20:25:27.3476312Z 2025-05-07T20:25:27.3477467Z 2025-05-07T20:25:27.3656501Z libcusparse-12.5.7.5 | 164.9 MB | ########1 | 82%  2025-05-07T20:25:27.3656841Z 2025-05-07T20:25:27.3656845Z 2025-05-07T20:25:27.3656849Z 2025-05-07T20:25:27.3659376Z 2025-05-07T20:25:27.3897667Z libcufft-11.3.3.41 | 147.4 MB | #########7 | 97%  2025-05-07T20:25:27.3899260Z 2025-05-07T20:25:27.4147175Z nsight-compute-2025. | 320.6 MB | ###5 | 36%  2025-05-07T20:25:27.4517142Z libcublas-12.8.3.14 | 460.2 MB | ##8 | 29% 2025-05-07T20:25:27.4517410Z 2025-05-07T20:25:27.4520063Z 2025-05-07T20:25:27.4601910Z libcusparse-12.5.7.5 | 164.9 MB | ########4 | 84%  2025-05-07T20:25:27.4602212Z 2025-05-07T20:25:27.4602216Z 2025-05-07T20:25:27.4602821Z 2025-05-07T20:25:27.4684905Z libcusolver-11.7.2.5 | 156.9 MB | ########1 | 81%  2025-05-07T20:25:27.4685375Z 2025-05-07T20:25:27.4685381Z 2025-05-07T20:25:27.4685386Z 2025-05-07T20:25:27.4686171Z 2025-05-07T20:25:27.4901699Z libcufft-11.3.3.41 | 147.4 MB | #########9 | 100%  2025-05-07T20:25:27.4901992Z 2025-05-07T20:25:27.5149316Z nsight-compute-2025. | 320.6 MB | ###6 | 36%  2025-05-07T20:25:27.5520867Z libcublas-12.8.3.14 | 460.2 MB | ##9 | 30% 2025-05-07T20:25:27.5521200Z 2025-05-07T20:25:27.5521206Z 2025-05-07T20:25:27.5609253Z libcusparse-12.5.7.5 | 164.9 MB | ########6 | 86%  2025-05-07T20:25:27.5609564Z 2025-05-07T20:25:27.5609568Z 2025-05-07T20:25:27.5609571Z 2025-05-07T20:25:27.5905433Z libcusolver-11.7.2.5 | 156.9 MB | ########3 | 83%  2025-05-07T20:25:27.5910930Z 2025-05-07T20:25:27.6174833Z nsight-compute-2025. | 320.6 MB | ###7 | 37%  2025-05-07T20:25:27.6521574Z libcublas-12.8.3.14 | 460.2 MB | ### | 30% 2025-05-07T20:25:27.6522254Z 2025-05-07T20:25:27.6522515Z 2025-05-07T20:25:27.6679898Z libcusparse-12.5.7.5 | 164.9 MB | ########8 | 89%  2025-05-07T20:25:27.6680602Z 2025-05-07T20:25:27.6680608Z 2025-05-07T20:25:27.6682259Z 2025-05-07T20:25:27.6909999Z libcusolver-11.7.2.5 | 156.9 MB | ########5 | 85%  2025-05-07T20:25:27.6910295Z 2025-05-07T20:25:27.7176991Z nsight-compute-2025. | 320.6 MB | ###8 | 38%  2025-05-07T20:25:27.7525960Z libcublas-12.8.3.14 | 460.2 MB | ###1 | 31% 2025-05-07T20:25:27.7526220Z 2025-05-07T20:25:27.7528515Z 2025-05-07T20:25:27.7683524Z libcusparse-12.5.7.5 | 164.9 MB | #########1 | 92%  2025-05-07T20:25:27.7683872Z 2025-05-07T20:25:27.7683879Z 2025-05-07T20:25:27.7689104Z 2025-05-07T20:25:27.7913620Z libcusolver-11.7.2.5 | 156.9 MB | ########7 | 87%  2025-05-07T20:25:27.7914711Z 2025-05-07T20:25:27.8178275Z nsight-compute-2025. | 320.6 MB | ###9 | 39%  2025-05-07T20:25:27.8546936Z libcublas-12.8.3.14 | 460.2 MB | ###2 | 32% 2025-05-07T20:25:27.8547283Z 2025-05-07T20:25:27.8547292Z 2025-05-07T20:25:27.8684724Z libcusparse-12.5.7.5 | 164.9 MB | #########3 | 94%  2025-05-07T20:25:27.8685067Z 2025-05-07T20:25:27.8685071Z 2025-05-07T20:25:27.8686899Z 2025-05-07T20:25:27.8917030Z libcusolver-11.7.2.5 | 156.9 MB | ########9 | 89%  2025-05-07T20:25:27.8917318Z 2025-05-07T20:25:27.9232184Z nsight-compute-2025. | 320.6 MB | #### | 40%  2025-05-07T20:25:27.9548625Z libcublas-12.8.3.14 | 460.2 MB | ###3 | 33% 2025-05-07T20:25:27.9548993Z 2025-05-07T20:25:27.9549002Z 2025-05-07T20:25:27.9684731Z libcusparse-12.5.7.5 | 164.9 MB | #########6 | 96%  2025-05-07T20:25:27.9685019Z 2025-05-07T20:25:27.9685023Z 2025-05-07T20:25:27.9686736Z 2025-05-07T20:25:27.9920433Z libcusolver-11.7.2.5 | 156.9 MB | #########1 | 91%  2025-05-07T20:25:27.9920765Z 2025-05-07T20:25:28.0286678Z nsight-compute-2025. | 320.6 MB | ####1 | 41%  2025-05-07T20:25:28.0631516Z libcublas-12.8.3.14 | 460.2 MB | ###4 | 34% 2025-05-07T20:25:28.0631900Z 2025-05-07T20:25:28.0633213Z 2025-05-07T20:25:28.0686339Z libcusparse-12.5.7.5 | 164.9 MB | #########8 | 99%  2025-05-07T20:25:28.0686634Z 2025-05-07T20:25:28.0686639Z 2025-05-07T20:25:28.0689158Z 2025-05-07T20:25:28.0923517Z libcusolver-11.7.2.5 | 156.9 MB | #########3 | 93%  2025-05-07T20:25:28.0923982Z 2025-05-07T20:25:28.1309779Z nsight-compute-2025. | 320.6 MB | ####2 | 42%  2025-05-07T20:25:28.1690090Z libcublas-12.8.3.14 | 460.2 MB | ###4 | 35% 2025-05-07T20:25:28.1690471Z 2025-05-07T20:25:28.1690475Z 2025-05-07T20:25:28.1691970Z 2025-05-07T20:25:28.1923892Z libcusolver-11.7.2.5 | 156.9 MB | #########5 | 96%  2025-05-07T20:25:28.1924196Z 2025-05-07T20:25:28.2311174Z nsight-compute-2025. | 320.6 MB | ####3 | 43%  2025-05-07T20:25:28.2917622Z libcublas-12.8.3.14 | 460.2 MB | ###5 | 36% 2025-05-07T20:25:28.2917940Z 2025-05-07T20:25:28.2917946Z 2025-05-07T20:25:28.2917952Z 2025-05-07T20:25:28.2927220Z libcusolver-11.7.2.5 | 156.9 MB | #########8 | 98%  2025-05-07T20:25:28.2927549Z 2025-05-07T20:25:28.3311848Z nsight-compute-2025. | 320.6 MB | ####4 | 45%  2025-05-07T20:25:28.3928855Z libcublas-12.8.3.14 | 460.2 MB | ###7 | 37% 2025-05-07T20:25:28.3929483Z 2025-05-07T20:25:28.4313511Z nsight-compute-2025. | 320.6 MB | ####6 | 46%  2025-05-07T20:25:28.4929501Z libcublas-12.8.3.14 | 460.2 MB | ###8 | 38% 2025-05-07T20:25:28.4929807Z 2025-05-07T20:25:28.5313768Z nsight-compute-2025. | 320.6 MB | ####7 | 48%  2025-05-07T20:25:28.5931455Z libcublas-12.8.3.14 | 460.2 MB | ###9 | 39% 2025-05-07T20:25:28.5932473Z 2025-05-07T20:25:28.6316396Z nsight-compute-2025. | 320.6 MB | ####9 | 50%  2025-05-07T20:25:28.6934144Z libcublas-12.8.3.14 | 460.2 MB | #### | 41% 2025-05-07T20:25:28.6934704Z 2025-05-07T20:25:28.7317188Z nsight-compute-2025. | 320.6 MB | #####1 | 51%  2025-05-07T20:25:28.7937757Z libcublas-12.8.3.14 | 460.2 MB | ####1 | 42% 2025-05-07T20:25:28.7941835Z 2025-05-07T20:25:28.8338784Z nsight-compute-2025. | 320.6 MB | #####2 | 53%  2025-05-07T20:25:28.8942231Z libcublas-12.8.3.14 | 460.2 MB | ####2 | 43% 2025-05-07T20:25:28.8942842Z 2025-05-07T20:25:28.9340684Z nsight-compute-2025. | 320.6 MB | #####4 | 54%  2025-05-07T20:25:29.0342613Z libcublas-12.8.3.14 | 460.2 MB | ####4 | 44% 2025-05-07T20:25:29.1343461Z libcublas-12.8.3.14 | 460.2 MB | ####5 | 46% 2025-05-07T20:25:29.2348895Z libcublas-12.8.3.14 | 460.2 MB | ####7 | 48% 2025-05-07T20:25:29.2373595Z libcublas-12.8.3.14 | 460.2 MB | ####9 | 50% 2025-05-07T20:25:29.2373886Z 2025-05-07T20:25:29.3375016Z nsight-compute-2025. | 320.6 MB | #####6 | 56%  2025-05-07T20:25:29.3375460Z 2025-05-07T20:25:29.3678246Z nsight-compute-2025. | 320.6 MB | #####7 | 58%  2025-05-07T20:25:29.4374431Z libcublas-12.8.3.14 | 460.2 MB | #####1 | 51% 2025-05-07T20:25:29.4376036Z 2025-05-07T20:25:29.4950585Z nsight-compute-2025. | 320.6 MB | #####9 | 59%  2025-05-07T20:25:29.5377419Z libcublas-12.8.3.14 | 460.2 MB | #####2 | 53% 2025-05-07T20:25:29.5377768Z 2025-05-07T20:25:29.6059388Z nsight-compute-2025. | 320.6 MB | ######1 | 61%  2025-05-07T20:25:29.6378837Z libcublas-12.8.3.14 | 460.2 MB | #####3 | 54% 2025-05-07T20:25:29.6379232Z 2025-05-07T20:25:29.7180814Z nsight-compute-2025. | 320.6 MB | ######2 | 63%  2025-05-07T20:25:29.7378999Z libcublas-12.8.3.14 | 460.2 MB | #####5 | 55% 2025-05-07T20:25:29.7379960Z 2025-05-07T20:25:29.8288581Z nsight-compute-2025. | 320.6 MB | ######4 | 64%  2025-05-07T20:25:29.8402311Z libcublas-12.8.3.14 | 460.2 MB | #####6 | 57% 2025-05-07T20:25:29.8402593Z 2025-05-07T20:25:29.9411411Z nsight-compute-2025. | 320.6 MB | ######5 | 66%  2025-05-07T20:25:29.9429574Z libcublas-12.8.3.14 | 460.2 MB | #####7 | 58% 2025-05-07T20:25:29.9429889Z 2025-05-07T20:25:29.9909784Z nsight-compute-2025. | 320.6 MB | ######7 | 67%  2025-05-07T20:25:29.9910179Z 2025-05-07T20:25:29.9910187Z 2025-05-07T20:25:29.9910192Z 2025-05-07T20:25:29.9910197Z 2025-05-07T20:25:30.0431212Z libcufft-11.3.3.41 | 147.4 MB | ########## | 100%  2025-05-07T20:25:30.0431516Z 2025-05-07T20:25:30.0431521Z 2025-05-07T20:25:30.0431524Z 2025-05-07T20:25:30.0431528Z 2025-05-07T20:25:30.0432216Z 2025-05-07T20:25:30.0435340Z libnpp-12.3.3.65 | 130.6 MB | | 0%  2025-05-07T20:25:30.0435716Z 2025-05-07T20:25:30.0689095Z nsight-compute-2025. | 320.6 MB | ######9 | 69%  2025-05-07T20:25:30.1436694Z libcublas-12.8.3.14 | 460.2 MB | #####8 | 59% 2025-05-07T20:25:30.1436975Z 2025-05-07T20:25:30.1437125Z 2025-05-07T20:25:30.1437129Z 2025-05-07T20:25:30.1437142Z 2025-05-07T20:25:30.1437718Z 2025-05-07T20:25:30.1754380Z libnpp-12.3.3.65 | 130.6 MB | 2 | 3%  2025-05-07T20:25:30.1756057Z 2025-05-07T20:25:30.2222770Z nsight-compute-2025. | 320.6 MB | ####### | 71%  2025-05-07T20:25:30.2437704Z libcublas-12.8.3.14 | 460.2 MB | ###### | 60% 2025-05-07T20:25:30.2437968Z 2025-05-07T20:25:30.2437972Z 2025-05-07T20:25:30.2437976Z 2025-05-07T20:25:30.2437979Z 2025-05-07T20:25:30.2438328Z 2025-05-07T20:25:30.3080347Z libnpp-12.3.3.65 | 130.6 MB | 5 | 5%  2025-05-07T20:25:30.3083123Z 2025-05-07T20:25:30.3442973Z nsight-compute-2025. | 320.6 MB | #######2 | 72%  2025-05-07T20:25:30.3443257Z 2025-05-07T20:25:30.3443261Z 2025-05-07T20:25:30.3443265Z 2025-05-07T20:25:30.3443268Z 2025-05-07T20:25:30.3443272Z 2025-05-07T20:25:30.3493758Z libnpp-12.3.3.65 | 130.6 MB | 8 | 8%  2025-05-07T20:25:30.4326219Z libcublas-12.8.3.14 | 460.2 MB | ######1 | 61% 2025-05-07T20:25:30.4326483Z 2025-05-07T20:25:30.4443919Z nsight-compute-2025. | 320.6 MB | #######3 | 73%  2025-05-07T20:25:30.4444200Z 2025-05-07T20:25:30.4444204Z 2025-05-07T20:25:30.4444208Z 2025-05-07T20:25:30.4444444Z 2025-05-07T20:25:30.4444448Z 2025-05-07T20:25:30.4643260Z libnpp-12.3.3.65 | 130.6 MB | #1 | 11%  2025-05-07T20:25:30.5446813Z libcublas-12.8.3.14 | 460.2 MB | ######2 | 62% 2025-05-07T20:25:30.5447122Z 2025-05-07T20:25:30.5447128Z 2025-05-07T20:25:30.5447131Z 2025-05-07T20:25:30.5447135Z 2025-05-07T20:25:30.5447372Z 2025-05-07T20:25:30.5477927Z libnpp-12.3.3.65 | 130.6 MB | #3 | 14%  2025-05-07T20:25:30.5481189Z 2025-05-07T20:25:30.5780623Z nsight-compute-2025. | 320.6 MB | #######4 | 75%  2025-05-07T20:25:30.6447867Z libcublas-12.8.3.14 | 460.2 MB | ######2 | 63% 2025-05-07T20:25:30.6448140Z 2025-05-07T20:25:30.6448147Z 2025-05-07T20:25:30.6448151Z 2025-05-07T20:25:30.6448154Z 2025-05-07T20:25:30.6450858Z 2025-05-07T20:25:30.6478854Z libnpp-12.3.3.65 | 130.6 MB | #6 | 17%  2025-05-07T20:25:30.6480667Z 2025-05-07T20:25:30.6951296Z nsight-compute-2025. | 320.6 MB | #######5 | 76%  2025-05-07T20:25:30.7448321Z libcublas-12.8.3.14 | 460.2 MB | ######3 | 64% 2025-05-07T20:25:30.7448692Z 2025-05-07T20:25:30.7448698Z 2025-05-07T20:25:30.7448704Z 2025-05-07T20:25:30.7448709Z 2025-05-07T20:25:30.7448746Z 2025-05-07T20:25:30.7579550Z libnpp-12.3.3.65 | 130.6 MB | #9 | 20%  2025-05-07T20:25:30.7581092Z 2025-05-07T20:25:30.7952689Z nsight-compute-2025. | 320.6 MB | #######7 | 77%  2025-05-07T20:25:30.8507398Z libcublas-12.8.3.14 | 460.2 MB | ######4 | 65% 2025-05-07T20:25:30.8507720Z 2025-05-07T20:25:30.8507725Z 2025-05-07T20:25:30.8507730Z 2025-05-07T20:25:30.8507734Z 2025-05-07T20:25:30.8511307Z 2025-05-07T20:25:30.8581375Z libnpp-12.3.3.65 | 130.6 MB | ##2 | 22%  2025-05-07T20:25:30.8583821Z 2025-05-07T20:25:30.8990625Z nsight-compute-2025. | 320.6 MB | #######8 | 78%  2025-05-07T20:25:30.9516449Z libcublas-12.8.3.14 | 460.2 MB | ######5 | 66% 2025-05-07T20:25:30.9516848Z 2025-05-07T20:25:30.9516854Z 2025-05-07T20:25:30.9516859Z 2025-05-07T20:25:30.9516865Z 2025-05-07T20:25:30.9517258Z 2025-05-07T20:25:30.9638926Z libnpp-12.3.3.65 | 130.6 MB | ##5 | 25%  2025-05-07T20:25:30.9643244Z 2025-05-07T20:25:30.9990536Z nsight-compute-2025. | 320.6 MB | #######9 | 80%  2025-05-07T20:25:31.0518738Z libcublas-12.8.3.14 | 460.2 MB | ######6 | 66% 2025-05-07T20:25:31.0519146Z 2025-05-07T20:25:31.0519153Z 2025-05-07T20:25:31.0519158Z 2025-05-07T20:25:31.0519163Z 2025-05-07T20:25:31.0519169Z 2025-05-07T20:25:31.0707081Z libnpp-12.3.3.65 | 130.6 MB | ##7 | 28%  2025-05-07T20:25:31.0707952Z 2025-05-07T20:25:31.1075653Z nsight-compute-2025. | 320.6 MB | ######## | 81%  2025-05-07T20:25:31.1525346Z libcublas-12.8.3.14 | 460.2 MB | ######7 | 67% 2025-05-07T20:25:31.1525613Z 2025-05-07T20:25:31.1525617Z 2025-05-07T20:25:31.1525621Z 2025-05-07T20:25:31.1525625Z 2025-05-07T20:25:31.1527194Z 2025-05-07T20:25:31.1780838Z libnpp-12.3.3.65 | 130.6 MB | ### | 31%  2025-05-07T20:25:31.1784078Z 2025-05-07T20:25:31.2041536Z nsight-compute-2025. | 320.6 MB | ########1 | 82%  2025-05-07T20:25:31.2041812Z 2025-05-07T20:25:31.2042069Z 2025-05-07T20:25:31.2107888Z libcusparse-12.5.7.5 | 164.9 MB | ########## | 100%  2025-05-07T20:25:31.2588930Z libcublas-12.8.3.14 | 460.2 MB | ######8 | 68% 2025-05-07T20:25:31.2589225Z 2025-05-07T20:25:31.2589230Z 2025-05-07T20:25:31.2589234Z 2025-05-07T20:25:31.2589238Z 2025-05-07T20:25:31.2589241Z 2025-05-07T20:25:31.2590705Z 2025-05-07T20:25:31.2623124Z cuda-nsight-12.8.55 | 113.2 MB | | 0%  2025-05-07T20:25:31.2623546Z 2025-05-07T20:25:31.2623551Z 2025-05-07T20:25:31.2623554Z 2025-05-07T20:25:31.2623558Z 2025-05-07T20:25:31.2623562Z 2025-05-07T20:25:31.2910471Z libnpp-12.3.3.65 | 130.6 MB | ###3 | 34%  2025-05-07T20:25:31.2911264Z 2025-05-07T20:25:31.3233962Z nsight-compute-2025. | 320.6 MB | ########3 | 83%  2025-05-07T20:25:31.3592336Z libcublas-12.8.3.14 | 460.2 MB | ######8 | 69% 2025-05-07T20:25:31.3592678Z 2025-05-07T20:25:31.3592694Z 2025-05-07T20:25:31.3592698Z 2025-05-07T20:25:31.3592725Z 2025-05-07T20:25:31.3592729Z 2025-05-07T20:25:31.3595077Z 2025-05-07T20:25:31.3733778Z cuda-nsight-12.8.55 | 113.2 MB | 2 | 2%  2025-05-07T20:25:31.3734178Z 2025-05-07T20:25:31.3734187Z 2025-05-07T20:25:31.3734191Z 2025-05-07T20:25:31.3734196Z 2025-05-07T20:25:31.3739928Z 2025-05-07T20:25:31.4169288Z libnpp-12.3.3.65 | 130.6 MB | ###6 | 36%  2025-05-07T20:25:31.4174430Z 2025-05-07T20:25:31.4471318Z nsight-compute-2025. | 320.6 MB | ########4 | 84%  2025-05-07T20:25:31.4471603Z 2025-05-07T20:25:31.4471608Z 2025-05-07T20:25:31.4475411Z 2025-05-07T20:25:31.4565911Z libcusolver-11.7.2.5 | 156.9 MB | ########## | 100%  2025-05-07T20:25:31.4592739Z libcublas-12.8.3.14 | 460.2 MB | ######9 | 70% 2025-05-07T20:25:31.4593143Z 2025-05-07T20:25:31.4593151Z 2025-05-07T20:25:31.4593156Z 2025-05-07T20:25:31.4593161Z 2025-05-07T20:25:31.4593167Z 2025-05-07T20:25:31.4597211Z 2025-05-07T20:25:31.4994977Z cuda-nsight-12.8.55 | 113.2 MB | 4 | 5%  2025-05-07T20:25:31.4995379Z 2025-05-07T20:25:31.4995385Z 2025-05-07T20:25:31.4995391Z 2025-05-07T20:25:31.4995397Z 2025-05-07T20:25:31.5000575Z 2025-05-07T20:25:31.5110175Z libnpp-12.3.3.65 | 130.6 MB | ###9 | 39%  2025-05-07T20:25:31.5110470Z 2025-05-07T20:25:31.5110475Z 2025-05-07T20:25:31.5110479Z 2025-05-07T20:25:31.5110485Z 2025-05-07T20:25:31.5110489Z 2025-05-07T20:25:31.5110493Z 2025-05-07T20:25:31.5110498Z 2025-05-07T20:25:31.5407166Z cuda-nvvp-12.8.57 | 112.4 MB | | 0%  2025-05-07T20:25:31.5407476Z 2025-05-07T20:25:31.5597560Z nsight-compute-2025. | 320.6 MB | ########5 | 85%  2025-05-07T20:25:31.5597912Z 2025-05-07T20:25:31.5597946Z 2025-05-07T20:25:31.5597950Z 2025-05-07T20:25:31.5597954Z 2025-05-07T20:25:31.5597957Z 2025-05-07T20:25:31.5599609Z 2025-05-07T20:25:31.5873738Z cuda-nsight-12.8.55 | 113.2 MB | 6 | 7%  2025-05-07T20:25:31.6115182Z libcublas-12.8.3.14 | 460.2 MB | ####### | 70% 2025-05-07T20:25:31.6115473Z 2025-05-07T20:25:31.6115477Z 2025-05-07T20:25:31.6115481Z 2025-05-07T20:25:31.6115485Z 2025-05-07T20:25:31.6115489Z 2025-05-07T20:25:31.6115492Z 2025-05-07T20:25:31.6115496Z 2025-05-07T20:25:31.6184422Z cuda-nvvp-12.8.57 | 112.4 MB | 2 | 2%  2025-05-07T20:25:31.6184734Z 2025-05-07T20:25:31.6184739Z 2025-05-07T20:25:31.6184743Z 2025-05-07T20:25:31.6184746Z 2025-05-07T20:25:31.6191007Z 2025-05-07T20:25:31.6609602Z libnpp-12.3.3.65 | 130.6 MB | ####1 | 42%  2025-05-07T20:25:31.6609915Z 2025-05-07T20:25:31.6609919Z 2025-05-07T20:25:31.6609923Z 2025-05-07T20:25:31.6609927Z 2025-05-07T20:25:31.6609965Z 2025-05-07T20:25:31.6609969Z 2025-05-07T20:25:31.6635970Z cuda-nsight-12.8.55 | 113.2 MB | 8 | 9%  2025-05-07T20:25:31.6636280Z 2025-05-07T20:25:31.7073692Z nsight-compute-2025. | 320.6 MB | ########6 | 86%  2025-05-07T20:25:31.7120689Z libcublas-12.8.3.14 | 460.2 MB | #######1 | 71% 2025-05-07T20:25:31.7121055Z 2025-05-07T20:25:31.7121061Z 2025-05-07T20:25:31.7121067Z 2025-05-07T20:25:31.7121072Z 2025-05-07T20:25:31.7121077Z 2025-05-07T20:25:31.7121094Z 2025-05-07T20:25:31.7121099Z 2025-05-07T20:25:31.7370966Z cuda-nvvp-12.8.57 | 112.4 MB | 4 | 4%  2025-05-07T20:25:31.7371346Z 2025-05-07T20:25:31.7371350Z 2025-05-07T20:25:31.7371354Z 2025-05-07T20:25:31.7371366Z 2025-05-07T20:25:31.7373544Z 2025-05-07T20:25:31.7613368Z libnpp-12.3.3.65 | 130.6 MB | ####3 | 44%  2025-05-07T20:25:31.7613740Z 2025-05-07T20:25:31.7613744Z 2025-05-07T20:25:31.7613748Z 2025-05-07T20:25:31.7614018Z 2025-05-07T20:25:31.7614024Z 2025-05-07T20:25:31.7614029Z 2025-05-07T20:25:31.7934123Z cuda-nsight-12.8.55 | 113.2 MB | #1 | 11%  2025-05-07T20:25:31.7934570Z 2025-05-07T20:25:31.8133512Z nsight-compute-2025. | 320.6 MB | ########7 | 87%  2025-05-07T20:25:31.8133795Z 2025-05-07T20:25:31.8133800Z 2025-05-07T20:25:31.8133804Z 2025-05-07T20:25:31.8133807Z 2025-05-07T20:25:31.8133811Z 2025-05-07T20:25:31.8133815Z 2025-05-07T20:25:31.8133818Z 2025-05-07T20:25:31.8328232Z cuda-nvvp-12.8.57 | 112.4 MB | 6 | 7%  2025-05-07T20:25:31.8596427Z libcublas-12.8.3.14 | 460.2 MB | #######1 | 72% 2025-05-07T20:25:31.8596715Z 2025-05-07T20:25:31.8596719Z 2025-05-07T20:25:31.8596723Z 2025-05-07T20:25:31.8596726Z 2025-05-07T20:25:31.8599079Z 2025-05-07T20:25:31.8612814Z libnpp-12.3.3.65 | 130.6 MB | ####6 | 46%  2025-05-07T20:25:31.8613111Z 2025-05-07T20:25:31.8613115Z 2025-05-07T20:25:31.8613119Z 2025-05-07T20:25:31.8613148Z 2025-05-07T20:25:31.8613152Z 2025-05-07T20:25:31.8613155Z 2025-05-07T20:25:31.9006750Z cuda-nsight-12.8.55 | 113.2 MB | #3 | 13%  2025-05-07T20:25:31.9009594Z 2025-05-07T20:25:31.9135166Z nsight-compute-2025. | 320.6 MB | ########7 | 88%  2025-05-07T20:25:31.9135559Z 2025-05-07T20:25:31.9135563Z 2025-05-07T20:25:31.9135577Z 2025-05-07T20:25:31.9135581Z 2025-05-07T20:25:31.9135585Z 2025-05-07T20:25:31.9135588Z 2025-05-07T20:25:31.9139207Z 2025-05-07T20:25:31.9405874Z cuda-nvvp-12.8.57 | 112.4 MB | 8 | 9%  2025-05-07T20:25:31.9614921Z libcublas-12.8.3.14 | 460.2 MB | #######2 | 72% 2025-05-07T20:25:31.9615319Z 2025-05-07T20:25:31.9615325Z 2025-05-07T20:25:31.9615330Z 2025-05-07T20:25:31.9615335Z 2025-05-07T20:25:31.9615340Z 2025-05-07T20:25:31.9615345Z 2025-05-07T20:25:31.9714090Z cuda-nsight-12.8.55 | 113.2 MB | #5 | 16%  2025-05-07T20:25:31.9714404Z 2025-05-07T20:25:31.9714434Z 2025-05-07T20:25:31.9714438Z 2025-05-07T20:25:31.9714442Z 2025-05-07T20:25:31.9714445Z 2025-05-07T20:25:32.0139853Z libnpp-12.3.3.65 | 130.6 MB | ####8 | 48%  2025-05-07T20:25:32.0140298Z 2025-05-07T20:25:32.0140338Z 2025-05-07T20:25:32.0140343Z 2025-05-07T20:25:32.0140349Z 2025-05-07T20:25:32.0140354Z 2025-05-07T20:25:32.0140359Z 2025-05-07T20:25:32.0140374Z 2025-05-07T20:25:32.0204137Z cuda-nvvp-12.8.57 | 112.4 MB | # | 11%  2025-05-07T20:25:32.0205942Z 2025-05-07T20:25:32.0539033Z nsight-compute-2025. | 320.6 MB | ########8 | 89%  2025-05-07T20:25:32.0616726Z libcublas-12.8.3.14 | 460.2 MB | #######2 | 73% 2025-05-07T20:25:32.0617101Z 2025-05-07T20:25:32.0617107Z 2025-05-07T20:25:32.0617112Z 2025-05-07T20:25:32.0617118Z 2025-05-07T20:25:32.0617123Z 2025-05-07T20:25:32.0617129Z 2025-05-07T20:25:32.0714810Z cuda-nsight-12.8.55 | 113.2 MB | #8 | 18%  2025-05-07T20:25:32.0715153Z 2025-05-07T20:25:32.0715158Z 2025-05-07T20:25:32.0715163Z 2025-05-07T20:25:32.0715168Z 2025-05-07T20:25:32.0715173Z 2025-05-07T20:25:32.1140063Z libnpp-12.3.3.65 | 130.6 MB | ##### | 50%  2025-05-07T20:25:32.1140611Z 2025-05-07T20:25:32.1140621Z 2025-05-07T20:25:32.1140626Z 2025-05-07T20:25:32.1140631Z 2025-05-07T20:25:32.1140636Z 2025-05-07T20:25:32.1140641Z 2025-05-07T20:25:32.1140646Z 2025-05-07T20:25:32.1361027Z cuda-nvvp-12.8.57 | 112.4 MB | #2 | 13%  2025-05-07T20:25:32.1364898Z 2025-05-07T20:25:32.1626607Z nsight-compute-2025. | 320.6 MB | ########9 | 90%  2025-05-07T20:25:32.1626903Z 2025-05-07T20:25:32.1626907Z 2025-05-07T20:25:32.1626912Z 2025-05-07T20:25:32.1626916Z 2025-05-07T20:25:32.1626921Z 2025-05-07T20:25:32.1626925Z 2025-05-07T20:25:32.1693185Z cuda-nsight-12.8.55 | 113.2 MB | ## | 21%  2025-05-07T20:25:32.1714554Z libcublas-12.8.3.14 | 460.2 MB | #######3 | 73% 2025-05-07T20:25:32.1715248Z 2025-05-07T20:25:32.1715253Z 2025-05-07T20:25:32.1715257Z 2025-05-07T20:25:32.1715260Z 2025-05-07T20:25:32.1715264Z 2025-05-07T20:25:32.2140925Z libnpp-12.3.3.65 | 130.6 MB | #####2 | 53%  2025-05-07T20:25:32.2141272Z 2025-05-07T20:25:32.2141276Z 2025-05-07T20:25:32.2141280Z 2025-05-07T20:25:32.2141284Z 2025-05-07T20:25:32.2141287Z 2025-05-07T20:25:32.2141291Z 2025-05-07T20:25:32.2141295Z 2025-05-07T20:25:32.2427553Z cuda-nvvp-12.8.57 | 112.4 MB | #5 | 15%  2025-05-07T20:25:32.2430506Z 2025-05-07T20:25:32.2671528Z nsight-compute-2025. | 320.6 MB | ######### | 90%  2025-05-07T20:25:32.2671921Z 2025-05-07T20:25:32.2671927Z 2025-05-07T20:25:32.2671932Z 2025-05-07T20:25:32.2671937Z 2025-05-07T20:25:32.2671942Z 2025-05-07T20:25:32.2675431Z 2025-05-07T20:25:32.2763140Z cuda-nsight-12.8.55 | 113.2 MB | ##2 | 23%  2025-05-07T20:25:32.2763479Z 2025-05-07T20:25:32.2763485Z 2025-05-07T20:25:32.2763521Z 2025-05-07T20:25:32.2763527Z 2025-05-07T20:25:32.2764997Z 2025-05-07T20:25:32.2806081Z libnpp-12.3.3.65 | 130.6 MB | #####4 | 55%  2025-05-07T20:25:32.3163246Z libcublas-12.8.3.14 | 460.2 MB | #######4 | 74% 2025-05-07T20:25:32.3163582Z 2025-05-07T20:25:32.3163588Z 2025-05-07T20:25:32.3163594Z 2025-05-07T20:25:32.3163599Z 2025-05-07T20:25:32.3163604Z 2025-05-07T20:25:32.3163610Z 2025-05-07T20:25:32.3170133Z 2025-05-07T20:25:32.3431743Z cuda-nvvp-12.8.57 | 112.4 MB | #7 | 17%  2025-05-07T20:25:32.3432055Z 2025-05-07T20:25:32.3673047Z nsight-compute-2025. | 320.6 MB | #########1 | 91%  2025-05-07T20:25:32.3673344Z 2025-05-07T20:25:32.3673348Z 2025-05-07T20:25:32.3673352Z 2025-05-07T20:25:32.3673356Z 2025-05-07T20:25:32.3673360Z 2025-05-07T20:25:32.3675754Z 2025-05-07T20:25:32.3836684Z cuda-nsight-12.8.55 | 113.2 MB | ##5 | 25%  2025-05-07T20:25:32.3837001Z 2025-05-07T20:25:32.3837044Z 2025-05-07T20:25:32.3837049Z 2025-05-07T20:25:32.3837054Z 2025-05-07T20:25:32.3837060Z 2025-05-07T20:25:32.4099508Z libnpp-12.3.3.65 | 130.6 MB | #####6 | 57%  2025-05-07T20:25:32.4407686Z libcublas-12.8.3.14 | 460.2 MB | #######4 | 75% 2025-05-07T20:25:32.4407959Z 2025-05-07T20:25:32.4407964Z 2025-05-07T20:25:32.4407969Z 2025-05-07T20:25:32.4407973Z 2025-05-07T20:25:32.4407977Z 2025-05-07T20:25:32.4407981Z 2025-05-07T20:25:32.4408149Z 2025-05-07T20:25:32.4458545Z cuda-nvvp-12.8.57 | 112.4 MB | #9 | 20%  2025-05-07T20:25:32.4464906Z 2025-05-07T20:25:32.4698941Z nsight-compute-2025. | 320.6 MB | #########1 | 92%  2025-05-07T20:25:32.4699306Z 2025-05-07T20:25:32.4699311Z 2025-05-07T20:25:32.4699316Z 2025-05-07T20:25:32.4699321Z 2025-05-07T20:25:32.4699325Z 2025-05-07T20:25:32.4703782Z 2025-05-07T20:25:32.4842188Z cuda-nsight-12.8.55 | 113.2 MB | ##7 | 28%  2025-05-07T20:25:32.4842638Z 2025-05-07T20:25:32.4842681Z 2025-05-07T20:25:32.4842689Z 2025-05-07T20:25:32.4842695Z 2025-05-07T20:25:32.4842701Z 2025-05-07T20:25:32.5137644Z libnpp-12.3.3.65 | 130.6 MB | #####8 | 59%  2025-05-07T20:25:32.5466696Z libcublas-12.8.3.14 | 460.2 MB | #######5 | 75% 2025-05-07T20:25:32.5466985Z 2025-05-07T20:25:32.5466990Z 2025-05-07T20:25:32.5467001Z 2025-05-07T20:25:32.5467005Z 2025-05-07T20:25:32.5467010Z 2025-05-07T20:25:32.5467013Z 2025-05-07T20:25:32.5467759Z 2025-05-07T20:25:32.5535801Z cuda-nvvp-12.8.57 | 112.4 MB | ##1 | 22%  2025-05-07T20:25:32.5536107Z 2025-05-07T20:25:32.5702861Z nsight-compute-2025. | 320.6 MB | #########2 | 93%  2025-05-07T20:25:32.5703141Z 2025-05-07T20:25:32.5703145Z 2025-05-07T20:25:32.5703149Z 2025-05-07T20:25:32.5703163Z 2025-05-07T20:25:32.5703166Z 2025-05-07T20:25:32.5712754Z 2025-05-07T20:25:32.6139333Z cuda-nsight-12.8.55 | 113.2 MB | ### | 30%  2025-05-07T20:25:32.6362215Z libcublas-12.8.3.14 | 460.2 MB | #######5 | 76% 2025-05-07T20:25:32.6362491Z 2025-05-07T20:25:32.6362500Z 2025-05-07T20:25:32.6362505Z 2025-05-07T20:25:32.6362516Z 2025-05-07T20:25:32.6366308Z 2025-05-07T20:25:32.6481938Z libnpp-12.3.3.65 | 130.6 MB | ###### | 61%  2025-05-07T20:25:32.6482233Z 2025-05-07T20:25:32.6482237Z 2025-05-07T20:25:32.6482247Z 2025-05-07T20:25:32.6482251Z 2025-05-07T20:25:32.6482255Z 2025-05-07T20:25:32.6482258Z 2025-05-07T20:25:32.6484290Z 2025-05-07T20:25:32.6538017Z cuda-nvvp-12.8.57 | 112.4 MB | ##3 | 24%  2025-05-07T20:25:32.6538320Z 2025-05-07T20:25:32.6722534Z nsight-compute-2025. | 320.6 MB | #########3 | 93%  2025-05-07T20:25:32.6722826Z 2025-05-07T20:25:32.6722831Z 2025-05-07T20:25:32.6722834Z 2025-05-07T20:25:32.6722847Z 2025-05-07T20:25:32.6722851Z 2025-05-07T20:25:32.6722855Z 2025-05-07T20:25:32.7260397Z cuda-nsight-12.8.55 | 113.2 MB | ###2 | 33%  2025-05-07T20:25:32.7438633Z libcublas-12.8.3.14 | 460.2 MB | #######6 | 76% 2025-05-07T20:25:32.7438906Z 2025-05-07T20:25:32.7438910Z 2025-05-07T20:25:32.7438914Z 2025-05-07T20:25:32.7438918Z 2025-05-07T20:25:32.7444249Z 2025-05-07T20:25:32.7543427Z libnpp-12.3.3.65 | 130.6 MB | ######2 | 63%  2025-05-07T20:25:32.7543755Z 2025-05-07T20:25:32.7770473Z nsight-compute-2025. | 320.6 MB | #########4 | 94%  2025-05-07T20:25:32.7770825Z 2025-05-07T20:25:32.7770829Z 2025-05-07T20:25:32.7770833Z 2025-05-07T20:25:32.7770836Z 2025-05-07T20:25:32.7770840Z 2025-05-07T20:25:32.7770843Z 2025-05-07T20:25:32.8260627Z cuda-nsight-12.8.55 | 113.2 MB | ###5 | 35%  2025-05-07T20:25:32.8443901Z libcublas-12.8.3.14 | 460.2 MB | #######6 | 77% 2025-05-07T20:25:32.8444267Z 2025-05-07T20:25:32.8444274Z 2025-05-07T20:25:32.8444279Z 2025-05-07T20:25:32.8444284Z 2025-05-07T20:25:32.8445682Z 2025-05-07T20:25:32.8450016Z libnpp-12.3.3.65 | 130.6 MB | ######5 | 65%  2025-05-07T20:25:32.8450500Z 2025-05-07T20:25:32.8450505Z 2025-05-07T20:25:32.8450510Z 2025-05-07T20:25:32.8450515Z 2025-05-07T20:25:32.8450521Z 2025-05-07T20:25:32.8450525Z 2025-05-07T20:25:32.8457103Z 2025-05-07T20:25:32.8545961Z cuda-nvvp-12.8.57 | 112.4 MB | ##5 | 26%  2025-05-07T20:25:32.8546352Z 2025-05-07T20:25:32.8800352Z nsight-compute-2025. | 320.6 MB | #########4 | 95%  2025-05-07T20:25:32.8800926Z 2025-05-07T20:25:32.8800932Z 2025-05-07T20:25:32.8800938Z 2025-05-07T20:25:32.8800943Z 2025-05-07T20:25:32.8800948Z 2025-05-07T20:25:32.8804516Z 2025-05-07T20:25:32.9267475Z cuda-nsight-12.8.55 | 113.2 MB | ###7 | 37%  2025-05-07T20:25:32.9449673Z libcublas-12.8.3.14 | 460.2 MB | #######7 | 77% 2025-05-07T20:25:32.9449945Z 2025-05-07T20:25:32.9450030Z 2025-05-07T20:25:32.9450034Z 2025-05-07T20:25:32.9450037Z 2025-05-07T20:25:32.9450132Z 2025-05-07T20:25:32.9456641Z libnpp-12.3.3.65 | 130.6 MB | ######7 | 67%  2025-05-07T20:25:32.9457078Z 2025-05-07T20:25:32.9457085Z 2025-05-07T20:25:32.9457090Z 2025-05-07T20:25:32.9457096Z 2025-05-07T20:25:32.9457101Z 2025-05-07T20:25:32.9457106Z 2025-05-07T20:25:32.9457367Z 2025-05-07T20:25:32.9753292Z cuda-nvvp-12.8.57 | 112.4 MB | ##7 | 28%  2025-05-07T20:25:32.9753681Z 2025-05-07T20:25:32.9913270Z nsight-compute-2025. | 320.6 MB | #########5 | 96%  2025-05-07T20:25:32.9913572Z 2025-05-07T20:25:32.9913576Z 2025-05-07T20:25:32.9913580Z 2025-05-07T20:25:32.9913583Z 2025-05-07T20:25:32.9913587Z 2025-05-07T20:25:32.9919488Z 2025-05-07T20:25:33.0288766Z cuda-nsight-12.8.55 | 113.2 MB | ###9 | 40%  2025-05-07T20:25:33.0458177Z libcublas-12.8.3.14 | 460.2 MB | #######7 | 78% 2025-05-07T20:25:33.0458446Z 2025-05-07T20:25:33.0458450Z 2025-05-07T20:25:33.0458463Z 2025-05-07T20:25:33.0458467Z 2025-05-07T20:25:33.0458471Z 2025-05-07T20:25:33.0458783Z 2025-05-07T20:25:33.0464505Z 2025-05-07T20:25:33.0713782Z cuda-nvvp-12.8.57 | 112.4 MB | ##9 | 30%  2025-05-07T20:25:33.0714129Z 2025-05-07T20:25:33.0714134Z 2025-05-07T20:25:33.0714137Z 2025-05-07T20:25:33.0714165Z 2025-05-07T20:25:33.0714169Z 2025-05-07T20:25:33.0753230Z libnpp-12.3.3.65 | 130.6 MB | ######9 | 69%  2025-05-07T20:25:33.0753586Z 2025-05-07T20:25:33.1031521Z nsight-compute-2025. | 320.6 MB | #########6 | 96%  2025-05-07T20:25:33.1031942Z 2025-05-07T20:25:33.1031949Z 2025-05-07T20:25:33.1031954Z 2025-05-07T20:25:33.1031970Z 2025-05-07T20:25:33.1031975Z 2025-05-07T20:25:33.1031980Z 2025-05-07T20:25:33.1295639Z cuda-nsight-12.8.55 | 113.2 MB | ####1 | 42%  2025-05-07T20:25:33.1460458Z libcublas-12.8.3.14 | 460.2 MB | #######8 | 78% 2025-05-07T20:25:33.1460732Z 2025-05-07T20:25:33.1460737Z 2025-05-07T20:25:33.1460740Z 2025-05-07T20:25:33.1460744Z 2025-05-07T20:25:33.1460748Z 2025-05-07T20:25:33.1460779Z 2025-05-07T20:25:33.1462895Z 2025-05-07T20:25:33.1756884Z cuda-nvvp-12.8.57 | 112.4 MB | ###1 | 32%  2025-05-07T20:25:33.1757185Z 2025-05-07T20:25:33.1774050Z nsight-compute-2025. | 320.6 MB | #########7 | 97%  2025-05-07T20:25:33.1774320Z 2025-05-07T20:25:33.1774325Z 2025-05-07T20:25:33.1774329Z 2025-05-07T20:25:33.1774332Z 2025-05-07T20:25:33.1776113Z 2025-05-07T20:25:33.2112498Z libnpp-12.3.3.65 | 130.6 MB | ####### | 71%  2025-05-07T20:25:33.2112866Z 2025-05-07T20:25:33.2112870Z 2025-05-07T20:25:33.2112873Z 2025-05-07T20:25:33.2112877Z 2025-05-07T20:25:33.2112881Z 2025-05-07T20:25:33.2116931Z 2025-05-07T20:25:33.2304570Z cuda-nsight-12.8.55 | 113.2 MB | ####4 | 44%  2025-05-07T20:25:33.2650644Z libcublas-12.8.3.14 | 460.2 MB | #######8 | 79% 2025-05-07T20:25:33.2651044Z 2025-05-07T20:25:33.2651050Z 2025-05-07T20:25:33.2651055Z 2025-05-07T20:25:33.2651060Z 2025-05-07T20:25:33.2651107Z 2025-05-07T20:25:33.2651114Z 2025-05-07T20:25:33.2651119Z 2025-05-07T20:25:33.2762386Z cuda-nvvp-12.8.57 | 112.4 MB | ###3 | 34%  2025-05-07T20:25:33.2764529Z 2025-05-07T20:25:33.2888592Z nsight-compute-2025. | 320.6 MB | #########8 | 98%  2025-05-07T20:25:33.2888873Z 2025-05-07T20:25:33.2888878Z 2025-05-07T20:25:33.2888881Z 2025-05-07T20:25:33.2888885Z 2025-05-07T20:25:33.2891007Z 2025-05-07T20:25:33.3197755Z libnpp-12.3.3.65 | 130.6 MB | #######2 | 73%  2025-05-07T20:25:33.3198107Z 2025-05-07T20:25:33.3198113Z 2025-05-07T20:25:33.3198118Z 2025-05-07T20:25:33.3198122Z 2025-05-07T20:25:33.3198126Z 2025-05-07T20:25:33.3200047Z 2025-05-07T20:25:33.3308860Z cuda-nsight-12.8.55 | 113.2 MB | ####6 | 46%  2025-05-07T20:25:33.3651318Z libcublas-12.8.3.14 | 460.2 MB | #######9 | 79% 2025-05-07T20:25:33.3651699Z 2025-05-07T20:25:33.3651706Z 2025-05-07T20:25:33.3651711Z 2025-05-07T20:25:33.3651716Z 2025-05-07T20:25:33.3651752Z 2025-05-07T20:25:33.3651758Z 2025-05-07T20:25:33.3651763Z 2025-05-07T20:25:33.3818812Z cuda-nvvp-12.8.57 | 112.4 MB | ###5 | 36%  2025-05-07T20:25:33.3819126Z 2025-05-07T20:25:33.3977291Z nsight-compute-2025. | 320.6 MB | #########8 | 99%  2025-05-07T20:25:33.3977595Z 2025-05-07T20:25:33.3977599Z 2025-05-07T20:25:33.3977603Z 2025-05-07T20:25:33.3977607Z 2025-05-07T20:25:33.3978319Z 2025-05-07T20:25:33.4289566Z libnpp-12.3.3.65 | 130.6 MB | #######4 | 75%  2025-05-07T20:25:33.4289883Z 2025-05-07T20:25:33.4289889Z 2025-05-07T20:25:33.4289895Z 2025-05-07T20:25:33.4289899Z 2025-05-07T20:25:33.4289902Z 2025-05-07T20:25:33.4291629Z 2025-05-07T20:25:33.4309109Z cuda-nsight-12.8.55 | 113.2 MB | ####8 | 48%  2025-05-07T20:25:33.4652300Z libcublas-12.8.3.14 | 460.2 MB | #######9 | 80% 2025-05-07T20:25:33.4652586Z 2025-05-07T20:25:33.4652596Z 2025-05-07T20:25:33.4652601Z 2025-05-07T20:25:33.4652834Z 2025-05-07T20:25:33.4652838Z 2025-05-07T20:25:33.4652842Z 2025-05-07T20:25:33.4654392Z 2025-05-07T20:25:33.4824512Z cuda-nvvp-12.8.57 | 112.4 MB | ###7 | 38%  2025-05-07T20:25:33.4826106Z 2025-05-07T20:25:33.4979957Z nsight-compute-2025. | 320.6 MB | #########9 | 100%  2025-05-07T20:25:33.4980249Z 2025-05-07T20:25:33.4980254Z 2025-05-07T20:25:33.4980257Z 2025-05-07T20:25:33.4980261Z 2025-05-07T20:25:33.4980265Z 2025-05-07T20:25:33.5292957Z libnpp-12.3.3.65 | 130.6 MB | #######6 | 76%  2025-05-07T20:25:33.5293424Z 2025-05-07T20:25:33.5293429Z 2025-05-07T20:25:33.5293432Z 2025-05-07T20:25:33.5293436Z 2025-05-07T20:25:33.5293440Z 2025-05-07T20:25:33.5295505Z 2025-05-07T20:25:33.5313084Z cuda-nsight-12.8.55 | 113.2 MB | ##### | 50%  2025-05-07T20:25:33.5655332Z libcublas-12.8.3.14 | 460.2 MB | ######## | 80% 2025-05-07T20:25:33.5655740Z 2025-05-07T20:25:33.5655746Z 2025-05-07T20:25:33.5655786Z 2025-05-07T20:25:33.5655791Z 2025-05-07T20:25:33.5655797Z 2025-05-07T20:25:33.5655802Z 2025-05-07T20:25:33.5659166Z 2025-05-07T20:25:33.5983753Z cuda-nvvp-12.8.57 | 112.4 MB | #### | 40%  2025-05-07T20:25:33.5984221Z 2025-05-07T20:25:33.5984229Z 2025-05-07T20:25:33.5984235Z 2025-05-07T20:25:33.5984240Z 2025-05-07T20:25:33.5984245Z 2025-05-07T20:25:33.6296769Z libnpp-12.3.3.65 | 130.6 MB | #######8 | 78%  2025-05-07T20:25:33.6297076Z 2025-05-07T20:25:33.6297082Z 2025-05-07T20:25:33.6297086Z 2025-05-07T20:25:33.6297090Z 2025-05-07T20:25:33.6297094Z 2025-05-07T20:25:33.6298235Z 2025-05-07T20:25:33.6317245Z cuda-nsight-12.8.55 | 113.2 MB | #####2 | 52%  2025-05-07T20:25:33.6655900Z libcublas-12.8.3.14 | 460.2 MB | ######## | 81% 2025-05-07T20:25:33.6656171Z 2025-05-07T20:25:33.6656183Z 2025-05-07T20:25:33.6656187Z 2025-05-07T20:25:33.6656191Z 2025-05-07T20:25:33.6656196Z 2025-05-07T20:25:33.6656199Z 2025-05-07T20:25:33.6656486Z 2025-05-07T20:25:33.7005508Z cuda-nvvp-12.8.57 | 112.4 MB | ####2 | 42%  2025-05-07T20:25:33.7005912Z 2025-05-07T20:25:33.7005918Z 2025-05-07T20:25:33.7005923Z 2025-05-07T20:25:33.7005953Z 2025-05-07T20:25:33.7008401Z 2025-05-07T20:25:33.7320278Z libnpp-12.3.3.65 | 130.6 MB | ######## | 80%  2025-05-07T20:25:33.7321039Z libcublas-12.8.3.14 | 460.2 MB | ########1 | 81% 2025-05-07T20:25:33.7321291Z 2025-05-07T20:25:33.7321295Z 2025-05-07T20:25:33.7321299Z 2025-05-07T20:25:33.7321303Z 2025-05-07T20:25:33.7321307Z 2025-05-07T20:25:33.7321654Z 2025-05-07T20:25:33.7659906Z cuda-nsight-12.8.55 | 113.2 MB | #####4 | 54%  2025-05-07T20:25:33.7660246Z 2025-05-07T20:25:33.7660250Z 2025-05-07T20:25:33.7660254Z 2025-05-07T20:25:33.7660258Z 2025-05-07T20:25:33.7660261Z 2025-05-07T20:25:33.7660266Z 2025-05-07T20:25:33.7660644Z 2025-05-07T20:25:33.8006465Z cuda-nvvp-12.8.57 | 112.4 MB | ####4 | 45%  2025-05-07T20:25:33.8006919Z 2025-05-07T20:25:33.8006926Z 2025-05-07T20:25:33.8006931Z 2025-05-07T20:25:33.8006935Z 2025-05-07T20:25:33.8006947Z 2025-05-07T20:25:33.8324556Z libnpp-12.3.3.65 | 130.6 MB | ########2 | 82%  2025-05-07T20:25:33.8324949Z 2025-05-07T20:25:33.8324955Z 2025-05-07T20:25:33.8324960Z 2025-05-07T20:25:33.8324965Z 2025-05-07T20:25:33.8324970Z 2025-05-07T20:25:33.8324980Z 2025-05-07T20:25:33.8660426Z cuda-nsight-12.8.55 | 113.2 MB | #####7 | 57%  2025-05-07T20:25:33.8660838Z 2025-05-07T20:25:33.8660844Z 2025-05-07T20:25:33.8660849Z 2025-05-07T20:25:33.8660854Z 2025-05-07T20:25:33.8660859Z 2025-05-07T20:25:33.8660864Z 2025-05-07T20:25:33.8663513Z 2025-05-07T20:25:33.9007747Z cuda-nvvp-12.8.57 | 112.4 MB | ####7 | 47%  2025-05-07T20:25:33.9008150Z 2025-05-07T20:25:33.9008156Z 2025-05-07T20:25:33.9008161Z 2025-05-07T20:25:33.9008166Z 2025-05-07T20:25:33.9010197Z 2025-05-07T20:25:33.9324140Z libnpp-12.3.3.65 | 130.6 MB | ########4 | 84%  2025-05-07T20:25:33.9324533Z 2025-05-07T20:25:33.9324539Z 2025-05-07T20:25:33.9324544Z 2025-05-07T20:25:33.9324550Z 2025-05-07T20:25:33.9324573Z 2025-05-07T20:25:33.9324587Z 2025-05-07T20:25:33.9523997Z cuda-nsight-12.8.55 | 113.2 MB | #####9 | 60%  2025-05-07T20:25:33.9713349Z libcublas-12.8.3.14 | 460.2 MB | ########2 | 82% 2025-05-07T20:25:33.9713693Z 2025-05-07T20:25:33.9713700Z 2025-05-07T20:25:33.9713705Z 2025-05-07T20:25:33.9713710Z 2025-05-07T20:25:33.9713715Z 2025-05-07T20:25:33.9713720Z 2025-05-07T20:25:33.9715978Z 2025-05-07T20:25:34.0333529Z cuda-nvvp-12.8.57 | 112.4 MB | ####9 | 50%  2025-05-07T20:25:34.0333843Z 2025-05-07T20:25:34.0333847Z 2025-05-07T20:25:34.0333851Z 2025-05-07T20:25:34.0333855Z 2025-05-07T20:25:34.0333859Z 2025-05-07T20:25:34.0339030Z 2025-05-07T20:25:34.0546139Z cuda-nsight-12.8.55 | 113.2 MB | ######2 | 62%  2025-05-07T20:25:34.0631495Z libcublas-12.8.3.14 | 460.2 MB | ########2 | 82% 2025-05-07T20:25:34.0631864Z 2025-05-07T20:25:34.0631870Z 2025-05-07T20:25:34.0631875Z 2025-05-07T20:25:34.0631880Z 2025-05-07T20:25:34.0631904Z 2025-05-07T20:25:34.0763623Z libnpp-12.3.3.65 | 130.6 MB | ########5 | 86%  2025-05-07T20:25:34.0764003Z 2025-05-07T20:25:34.0764009Z 2025-05-07T20:25:34.0764014Z 2025-05-07T20:25:34.0764019Z 2025-05-07T20:25:34.0764024Z 2025-05-07T20:25:34.0764029Z 2025-05-07T20:25:34.0765288Z 2025-05-07T20:25:34.1397454Z cuda-nvvp-12.8.57 | 112.4 MB | #####1 | 52%  2025-05-07T20:25:34.1397870Z 2025-05-07T20:25:34.1397876Z 2025-05-07T20:25:34.1397880Z 2025-05-07T20:25:34.1397885Z 2025-05-07T20:25:34.1397890Z 2025-05-07T20:25:34.1398542Z 2025-05-07T20:25:34.1546351Z cuda-nsight-12.8.55 | 113.2 MB | ######4 | 65%  2025-05-07T20:25:34.1635207Z libcublas-12.8.3.14 | 460.2 MB | ########2 | 83% 2025-05-07T20:25:34.1635595Z 2025-05-07T20:25:34.1635601Z 2025-05-07T20:25:34.1635607Z 2025-05-07T20:25:34.1635626Z 2025-05-07T20:25:34.1635635Z 2025-05-07T20:25:34.1763865Z libnpp-12.3.3.65 | 130.6 MB | ########7 | 88%  2025-05-07T20:25:34.1764255Z 2025-05-07T20:25:34.1764261Z 2025-05-07T20:25:34.1764277Z 2025-05-07T20:25:34.1764282Z 2025-05-07T20:25:34.1764287Z 2025-05-07T20:25:34.1764292Z 2025-05-07T20:25:34.1766391Z 2025-05-07T20:25:34.2408906Z cuda-nvvp-12.8.57 | 112.4 MB | #####4 | 54%  2025-05-07T20:25:34.2409321Z 2025-05-07T20:25:34.2409327Z 2025-05-07T20:25:34.2409332Z 2025-05-07T20:25:34.2409337Z 2025-05-07T20:25:34.2409341Z 2025-05-07T20:25:34.2409348Z 2025-05-07T20:25:34.2627092Z cuda-nsight-12.8.55 | 113.2 MB | ######7 | 67%  2025-05-07T20:25:34.2774012Z libcublas-12.8.3.14 | 460.2 MB | ########3 | 83% 2025-05-07T20:25:34.2774371Z 2025-05-07T20:25:34.2774378Z 2025-05-07T20:25:34.2774418Z 2025-05-07T20:25:34.2774424Z 2025-05-07T20:25:34.2774429Z 2025-05-07T20:25:34.2774443Z 2025-05-07T20:25:34.2776170Z 2025-05-07T20:25:34.2882888Z cuda-nvvp-12.8.57 | 112.4 MB | #####6 | 56%  2025-05-07T20:25:34.2883538Z 2025-05-07T20:25:34.2883546Z 2025-05-07T20:25:34.2883561Z 2025-05-07T20:25:34.2883566Z 2025-05-07T20:25:34.2886642Z 2025-05-07T20:25:34.3431394Z libnpp-12.3.3.65 | 130.6 MB | ########9 | 89%  2025-05-07T20:25:34.3431782Z 2025-05-07T20:25:34.3431795Z 2025-05-07T20:25:34.3431799Z 2025-05-07T20:25:34.3431803Z 2025-05-07T20:25:34.3431807Z 2025-05-07T20:25:34.3431841Z 2025-05-07T20:25:34.3635289Z cuda-nsight-12.8.55 | 113.2 MB | ######9 | 70%  2025-05-07T20:25:34.3885191Z libcublas-12.8.3.14 | 460.2 MB | ########4 | 84% 2025-05-07T20:25:34.3885552Z 2025-05-07T20:25:34.3885558Z 2025-05-07T20:25:34.3885564Z 2025-05-07T20:25:34.3885569Z 2025-05-07T20:25:34.3888211Z 2025-05-07T20:25:34.4431793Z libnpp-12.3.3.65 | 130.6 MB | #########1 | 91%  2025-05-07T20:25:34.4432453Z 2025-05-07T20:25:34.4432461Z 2025-05-07T20:25:34.4432479Z 2025-05-07T20:25:34.4432485Z 2025-05-07T20:25:34.4432489Z 2025-05-07T20:25:34.4436019Z 2025-05-07T20:25:34.4637457Z cuda-nsight-12.8.55 | 113.2 MB | #######2 | 72%  2025-05-07T20:25:34.4889419Z libcublas-12.8.3.14 | 460.2 MB | ########4 | 85% 2025-05-07T20:25:34.4889732Z 2025-05-07T20:25:34.4889737Z 2025-05-07T20:25:34.4889740Z 2025-05-07T20:25:34.4889744Z 2025-05-07T20:25:34.4893122Z 2025-05-07T20:25:34.5011029Z libnpp-12.3.3.65 | 130.6 MB | #########3 | 94%  2025-05-07T20:25:34.5011450Z 2025-05-07T20:25:34.5011456Z 2025-05-07T20:25:34.5011461Z 2025-05-07T20:25:34.5011466Z 2025-05-07T20:25:34.5011472Z 2025-05-07T20:25:34.5011477Z 2025-05-07T20:25:34.5011482Z 2025-05-07T20:25:34.5444914Z cuda-nvvp-12.8.57 | 112.4 MB | #####8 | 59%  2025-05-07T20:25:34.5445233Z 2025-05-07T20:25:34.5445267Z 2025-05-07T20:25:34.5445271Z 2025-05-07T20:25:34.5445274Z 2025-05-07T20:25:34.5445278Z 2025-05-07T20:25:34.5445290Z 2025-05-07T20:25:34.5641489Z cuda-nsight-12.8.55 | 113.2 MB | #######4 | 75%  2025-05-07T20:25:34.5977919Z libcublas-12.8.3.14 | 460.2 MB | ########5 | 85% 2025-05-07T20:25:34.5978315Z 2025-05-07T20:25:34.5978322Z 2025-05-07T20:25:34.5978327Z 2025-05-07T20:25:34.5978332Z 2025-05-07T20:25:34.5980088Z 2025-05-07T20:25:34.6012523Z libnpp-12.3.3.65 | 130.6 MB | #########5 | 95%  2025-05-07T20:25:34.6012810Z 2025-05-07T20:25:34.6012815Z 2025-05-07T20:25:34.6012818Z 2025-05-07T20:25:34.6012822Z 2025-05-07T20:25:34.6012825Z 2025-05-07T20:25:34.6012829Z 2025-05-07T20:25:34.6012833Z 2025-05-07T20:25:34.6642108Z cuda-nvvp-12.8.57 | 112.4 MB | ######1 | 61%  2025-05-07T20:25:34.6705351Z libcublas-12.8.3.14 | 460.2 MB | ########5 | 86% 2025-05-07T20:25:34.6705654Z 2025-05-07T20:25:34.6705698Z 2025-05-07T20:25:34.6705704Z 2025-05-07T20:25:34.6705711Z 2025-05-07T20:25:34.6705717Z 2025-05-07T20:25:34.6708079Z 2025-05-07T20:25:34.6978807Z cuda-nsight-12.8.55 | 113.2 MB | #######7 | 77%  2025-05-07T20:25:34.6979134Z 2025-05-07T20:25:34.6979138Z 2025-05-07T20:25:34.6979142Z 2025-05-07T20:25:34.6979146Z 2025-05-07T20:25:34.6981882Z 2025-05-07T20:25:34.7020548Z libnpp-12.3.3.65 | 130.6 MB | #########7 | 97%  2025-05-07T20:25:34.7020896Z 2025-05-07T20:25:34.7020903Z 2025-05-07T20:25:34.7020908Z 2025-05-07T20:25:34.7020913Z 2025-05-07T20:25:34.7020918Z 2025-05-07T20:25:34.7020923Z 2025-05-07T20:25:34.7020928Z 2025-05-07T20:25:34.7692126Z cuda-nvvp-12.8.57 | 112.4 MB | ######3 | 63%  2025-05-07T20:25:34.7710773Z libcublas-12.8.3.14 | 460.2 MB | ########6 | 86% 2025-05-07T20:25:34.7711145Z 2025-05-07T20:25:34.7711152Z 2025-05-07T20:25:34.7711157Z 2025-05-07T20:25:34.7711162Z 2025-05-07T20:25:34.7711201Z 2025-05-07T20:25:34.7711206Z 2025-05-07T20:25:34.7978949Z cuda-nsight-12.8.55 | 113.2 MB | #######9 | 80%  2025-05-07T20:25:34.7979250Z 2025-05-07T20:25:34.7979254Z 2025-05-07T20:25:34.7979257Z 2025-05-07T20:25:34.7979507Z 2025-05-07T20:25:34.7980599Z 2025-05-07T20:25:34.8030779Z libnpp-12.3.3.65 | 130.6 MB | #########9 | 99%  2025-05-07T20:25:34.8031064Z 2025-05-07T20:25:34.8031070Z 2025-05-07T20:25:34.8031074Z 2025-05-07T20:25:34.8031077Z 2025-05-07T20:25:34.8031081Z 2025-05-07T20:25:34.8031085Z 2025-05-07T20:25:34.8031088Z 2025-05-07T20:25:34.8694820Z cuda-nvvp-12.8.57 | 112.4 MB | ######5 | 66%  2025-05-07T20:25:34.8713608Z libcublas-12.8.3.14 | 460.2 MB | ########6 | 87% 2025-05-07T20:25:34.8713970Z 2025-05-07T20:25:34.8713975Z 2025-05-07T20:25:34.8713978Z 2025-05-07T20:25:34.8713985Z 2025-05-07T20:25:34.8713990Z 2025-05-07T20:25:34.8713996Z 2025-05-07T20:25:34.9035076Z cuda-nsight-12.8.55 | 113.2 MB | ########2 | 82%  2025-05-07T20:25:34.9035662Z 2025-05-07T20:25:34.9035668Z 2025-05-07T20:25:34.9035673Z 2025-05-07T20:25:34.9035677Z 2025-05-07T20:25:34.9035680Z 2025-05-07T20:25:34.9035691Z 2025-05-07T20:25:34.9035705Z 2025-05-07T20:25:34.9715054Z cuda-nvvp-12.8.57 | 112.4 MB | ######7 | 68%  2025-05-07T20:25:34.9715497Z 2025-05-07T20:25:34.9715513Z 2025-05-07T20:25:34.9715518Z 2025-05-07T20:25:34.9715524Z 2025-05-07T20:25:34.9715529Z 2025-05-07T20:25:34.9715534Z 2025-05-07T20:25:34.9725267Z cuda-nsight-12.8.55 | 113.2 MB | ########4 | 85%  2025-05-07T20:25:35.0044552Z libcublas-12.8.3.14 | 460.2 MB | ########7 | 88% 2025-05-07T20:25:35.0044824Z 2025-05-07T20:25:35.0045196Z 2025-05-07T20:25:35.0045213Z 2025-05-07T20:25:35.0045219Z 2025-05-07T20:25:35.0045225Z 2025-05-07T20:25:35.0045233Z 2025-05-07T20:25:35.0045241Z 2025-05-07T20:25:35.0717322Z cuda-nvvp-12.8.57 | 112.4 MB | ####### | 70%  2025-05-07T20:25:35.0717741Z 2025-05-07T20:25:35.0717747Z 2025-05-07T20:25:35.0717752Z 2025-05-07T20:25:35.0717757Z 2025-05-07T20:25:35.0717762Z 2025-05-07T20:25:35.0717767Z 2025-05-07T20:25:35.0725816Z cuda-nsight-12.8.55 | 113.2 MB | ########7 | 88%  2025-05-07T20:25:35.1046175Z libcublas-12.8.3.14 | 460.2 MB | ########8 | 88% 2025-05-07T20:25:35.1046581Z 2025-05-07T20:25:35.1046587Z 2025-05-07T20:25:35.1046593Z 2025-05-07T20:25:35.1046612Z 2025-05-07T20:25:35.1046618Z 2025-05-07T20:25:35.1046623Z 2025-05-07T20:25:35.1046628Z 2025-05-07T20:25:35.1717257Z cuda-nvvp-12.8.57 | 112.4 MB | #######2 | 73%  2025-05-07T20:25:35.1717698Z 2025-05-07T20:25:35.1717711Z 2025-05-07T20:25:35.1717715Z 2025-05-07T20:25:35.1717719Z 2025-05-07T20:25:35.1717726Z 2025-05-07T20:25:35.1717730Z 2025-05-07T20:25:35.1739881Z cuda-nsight-12.8.55 | 113.2 MB | ######### | 90%  2025-05-07T20:25:35.2049336Z libcublas-12.8.3.14 | 460.2 MB | ########8 | 89% 2025-05-07T20:25:35.2049758Z 2025-05-07T20:25:35.2049765Z 2025-05-07T20:25:35.2049771Z 2025-05-07T20:25:35.2049777Z 2025-05-07T20:25:35.2049782Z 2025-05-07T20:25:35.2049789Z 2025-05-07T20:25:35.2049809Z 2025-05-07T20:25:35.2720310Z cuda-nvvp-12.8.57 | 112.4 MB | #######5 | 75%  2025-05-07T20:25:35.2720690Z 2025-05-07T20:25:35.2720694Z 2025-05-07T20:25:35.2720698Z 2025-05-07T20:25:35.2720701Z 2025-05-07T20:25:35.2720705Z 2025-05-07T20:25:35.2720709Z 2025-05-07T20:25:35.2809302Z cuda-nsight-12.8.55 | 113.2 MB | #########2 | 93%  2025-05-07T20:25:35.3086989Z libcublas-12.8.3.14 | 460.2 MB | ########9 | 89% 2025-05-07T20:25:35.3087390Z 2025-05-07T20:25:35.3087397Z 2025-05-07T20:25:35.3087403Z 2025-05-07T20:25:35.3087409Z 2025-05-07T20:25:35.3087415Z 2025-05-07T20:25:35.3087421Z 2025-05-07T20:25:35.3088807Z 2025-05-07T20:25:35.3732359Z cuda-nvvp-12.8.57 | 112.4 MB | #######7 | 77%  2025-05-07T20:25:35.3732845Z 2025-05-07T20:25:35.3732852Z 2025-05-07T20:25:35.3732857Z 2025-05-07T20:25:35.3732862Z 2025-05-07T20:25:35.3732867Z 2025-05-07T20:25:35.3732873Z 2025-05-07T20:25:35.3813006Z cuda-nsight-12.8.55 | 113.2 MB | #########5 | 95%  2025-05-07T20:25:35.4093136Z libcublas-12.8.3.14 | 460.2 MB | ######### | 90% 2025-05-07T20:25:35.4093415Z 2025-05-07T20:25:35.4093419Z 2025-05-07T20:25:35.4093423Z 2025-05-07T20:25:35.4093426Z 2025-05-07T20:25:35.4093430Z 2025-05-07T20:25:35.4093434Z 2025-05-07T20:25:35.4096883Z 2025-05-07T20:25:35.4817981Z cuda-nvvp-12.8.57 | 112.4 MB | #######9 | 80%  2025-05-07T20:25:35.4927853Z libcublas-12.8.3.14 | 460.2 MB | ######### | 91% 2025-05-07T20:25:35.4928117Z 2025-05-07T20:25:35.4928121Z 2025-05-07T20:25:35.4928125Z 2025-05-07T20:25:35.4928129Z 2025-05-07T20:25:35.4928133Z 2025-05-07T20:25:35.4928137Z 2025-05-07T20:25:35.5205426Z cuda-nsight-12.8.55 | 113.2 MB | #########7 | 98%  2025-05-07T20:25:35.5205992Z 2025-05-07T20:25:35.5205997Z 2025-05-07T20:25:35.5206000Z 2025-05-07T20:25:35.5206004Z 2025-05-07T20:25:35.5206008Z 2025-05-07T20:25:35.5206012Z 2025-05-07T20:25:35.5206453Z 2025-05-07T20:25:35.5871081Z cuda-nvvp-12.8.57 | 112.4 MB | ########2 | 82%  2025-05-07T20:25:35.6205699Z libcublas-12.8.3.14 | 460.2 MB | #########1 | 91% 2025-05-07T20:25:35.6206029Z 2025-05-07T20:25:35.6206035Z 2025-05-07T20:25:35.6206039Z 2025-05-07T20:25:35.6206045Z 2025-05-07T20:25:35.6206050Z 2025-05-07T20:25:35.6206056Z 2025-05-07T20:25:35.6206061Z 2025-05-07T20:25:35.6871831Z cuda-nvvp-12.8.57 | 112.4 MB | ########4 | 84%  2025-05-07T20:25:35.7213852Z libcublas-12.8.3.14 | 460.2 MB | #########1 | 92% 2025-05-07T20:25:35.7214125Z 2025-05-07T20:25:35.7214138Z 2025-05-07T20:25:35.7214142Z 2025-05-07T20:25:35.7214146Z 2025-05-07T20:25:35.7214150Z 2025-05-07T20:25:35.7214153Z 2025-05-07T20:25:35.7214990Z 2025-05-07T20:25:35.7878979Z cuda-nvvp-12.8.57 | 112.4 MB | ########6 | 87%  2025-05-07T20:25:35.8218085Z libcublas-12.8.3.14 | 460.2 MB | #########2 | 93% 2025-05-07T20:25:35.8218353Z 2025-05-07T20:25:35.8218357Z 2025-05-07T20:25:35.8218400Z 2025-05-07T20:25:35.8218406Z 2025-05-07T20:25:35.8218421Z 2025-05-07T20:25:35.8218427Z 2025-05-07T20:25:35.8218431Z 2025-05-07T20:25:35.8879486Z cuda-nvvp-12.8.57 | 112.4 MB | ########9 | 90%  2025-05-07T20:25:35.9221282Z libcublas-12.8.3.14 | 460.2 MB | #########3 | 93% 2025-05-07T20:25:35.9221560Z 2025-05-07T20:25:35.9221564Z 2025-05-07T20:25:35.9221568Z 2025-05-07T20:25:35.9221572Z 2025-05-07T20:25:35.9221575Z 2025-05-07T20:25:35.9221580Z 2025-05-07T20:25:35.9221584Z 2025-05-07T20:25:35.9879800Z cuda-nvvp-12.8.57 | 112.4 MB | #########2 | 93%  2025-05-07T20:25:36.0223893Z libcublas-12.8.3.14 | 460.2 MB | #########3 | 94% 2025-05-07T20:25:36.0224236Z 2025-05-07T20:25:36.0224278Z 2025-05-07T20:25:36.0224284Z 2025-05-07T20:25:36.0224289Z 2025-05-07T20:25:36.0224294Z 2025-05-07T20:25:36.0224314Z 2025-05-07T20:25:36.0224750Z 2025-05-07T20:25:36.1055487Z cuda-nvvp-12.8.57 | 112.4 MB | #########5 | 95%  2025-05-07T20:25:36.1245399Z libcublas-12.8.3.14 | 460.2 MB | #########4 | 94% 2025-05-07T20:25:36.1245734Z 2025-05-07T20:25:36.1245739Z 2025-05-07T20:25:36.1245747Z 2025-05-07T20:25:36.1245752Z 2025-05-07T20:25:36.1245757Z 2025-05-07T20:25:36.1245763Z 2025-05-07T20:25:36.1245779Z 2025-05-07T20:25:36.2101633Z cuda-nvvp-12.8.57 | 112.4 MB | #########8 | 98%  2025-05-07T20:25:36.3102517Z libcublas-12.8.3.14 | 460.2 MB | #########5 | 95% 2025-05-07T20:25:36.4107167Z libcublas-12.8.3.14 | 460.2 MB | #########5 | 96% 2025-05-07T20:25:36.5106883Z libcublas-12.8.3.14 | 460.2 MB | #########6 | 96% 2025-05-07T20:25:36.6589376Z libcublas-12.8.3.14 | 460.2 MB | #########7 | 97% 2025-05-07T20:25:36.7594829Z libcublas-12.8.3.14 | 460.2 MB | #########7 | 98% 2025-05-07T20:25:36.8596627Z libcublas-12.8.3.14 | 460.2 MB | #########8 | 98% 2025-05-07T20:25:38.5124394Z libcublas-12.8.3.14 | 460.2 MB | #########9 | 99% 2025-05-07T20:25:38.5124974Z 2025-05-07T20:25:38.5124995Z 2025-05-07T20:25:38.5125000Z 2025-05-07T20:25:38.5125005Z 2025-05-07T20:25:39.0297125Z libcufft-11.3.3.41 | 147.4 MB | ########## | 100%  2025-05-07T20:25:39.0297464Z 2025-05-07T20:25:39.0297468Z 2025-05-07T20:25:39.0297472Z 2025-05-07T20:25:39.0297476Z 2025-05-07T20:25:39.0299762Z 2025-05-07T20:25:39.0769233Z libnpp-12.3.3.65 | 130.6 MB | ########## | 100%  2025-05-07T20:25:39.0769523Z 2025-05-07T20:25:39.0769527Z 2025-05-07T20:25:39.0769534Z 2025-05-07T20:25:39.0769538Z 2025-05-07T20:25:39.0769542Z 2025-05-07T20:25:39.0769546Z 2025-05-07T20:25:39.0769550Z 2025-05-07T20:25:39.0776407Z 2025-05-07T20:25:39.1773531Z cuda-nvrtc-12.8.61 | 63.1 MB | | 0%  2025-05-07T20:25:39.1774250Z 2025-05-07T20:25:39.1774256Z 2025-05-07T20:25:39.1774262Z 2025-05-07T20:25:39.1774267Z 2025-05-07T20:25:39.1774272Z 2025-05-07T20:25:39.1774288Z 2025-05-07T20:25:39.1774294Z 2025-05-07T20:25:39.1774315Z 2025-05-07T20:25:39.2774070Z cuda-nvrtc-12.8.61 | 63.1 MB | 5 | 6%  2025-05-07T20:25:39.2774378Z 2025-05-07T20:25:39.2774382Z 2025-05-07T20:25:39.2774393Z 2025-05-07T20:25:39.2774397Z 2025-05-07T20:25:39.2774401Z 2025-05-07T20:25:39.2774408Z 2025-05-07T20:25:39.2774413Z 2025-05-07T20:25:39.2774554Z 2025-05-07T20:25:39.3876596Z cuda-nvrtc-12.8.61 | 63.1 MB | #1 | 12%  2025-05-07T20:25:39.3876919Z 2025-05-07T20:25:39.3876923Z 2025-05-07T20:25:39.3876930Z 2025-05-07T20:25:39.3876935Z 2025-05-07T20:25:39.3876940Z 2025-05-07T20:25:39.3876945Z 2025-05-07T20:25:39.3876950Z 2025-05-07T20:25:39.3877252Z 2025-05-07T20:25:39.3913406Z cuda-nvrtc-12.8.61 | 63.1 MB | #7 | 18%  2025-05-07T20:25:39.3913745Z 2025-05-07T20:25:39.3913749Z 2025-05-07T20:25:39.3913753Z 2025-05-07T20:25:39.3913757Z 2025-05-07T20:25:39.3913760Z 2025-05-07T20:25:39.3913764Z 2025-05-07T20:25:39.4422795Z cuda-nsight-12.8.55 | 113.2 MB | ########## | 100%  2025-05-07T20:25:39.4423258Z 2025-05-07T20:25:39.4423265Z 2025-05-07T20:25:39.4423271Z 2025-05-07T20:25:39.4423278Z 2025-05-07T20:25:39.4423284Z 2025-05-07T20:25:39.4423291Z 2025-05-07T20:25:39.4423296Z 2025-05-07T20:25:39.4423304Z 2025-05-07T20:25:39.4423514Z 2025-05-07T20:25:39.4921809Z libcurand-10.3.9.55 | 43.6 MB | | 0%  2025-05-07T20:25:39.4922192Z 2025-05-07T20:25:39.4922198Z 2025-05-07T20:25:39.4922204Z 2025-05-07T20:25:39.4922209Z 2025-05-07T20:25:39.4922215Z 2025-05-07T20:25:39.4922219Z 2025-05-07T20:25:39.4922222Z 2025-05-07T20:25:39.4923673Z 2025-05-07T20:25:39.5423872Z cuda-nvrtc-12.8.61 | 63.1 MB | ##3 | 24%  2025-05-07T20:25:39.5424215Z 2025-05-07T20:25:39.5424219Z 2025-05-07T20:25:39.5424223Z 2025-05-07T20:25:39.5424228Z 2025-05-07T20:25:39.5424231Z 2025-05-07T20:25:39.5424235Z 2025-05-07T20:25:39.5424239Z 2025-05-07T20:25:39.5424254Z 2025-05-07T20:25:39.5424258Z 2025-05-07T20:25:39.6074309Z libcurand-10.3.9.55 | 43.6 MB | 6 | 7%  2025-05-07T20:25:39.6074772Z 2025-05-07T20:25:39.6074778Z 2025-05-07T20:25:39.6074783Z 2025-05-07T20:25:39.6074788Z 2025-05-07T20:25:39.6074793Z 2025-05-07T20:25:39.6074799Z 2025-05-07T20:25:39.6074806Z 2025-05-07T20:25:39.6080422Z 2025-05-07T20:25:39.6436059Z cuda-nvrtc-12.8.61 | 63.1 MB | ##9 | 29%  2025-05-07T20:25:39.6436378Z 2025-05-07T20:25:39.6436382Z 2025-05-07T20:25:39.6436386Z 2025-05-07T20:25:39.6436390Z 2025-05-07T20:25:39.6436393Z 2025-05-07T20:25:39.6436397Z 2025-05-07T20:25:39.6436400Z 2025-05-07T20:25:39.6436404Z 2025-05-07T20:25:39.6436417Z 2025-05-07T20:25:39.7102191Z libcurand-10.3.9.55 | 43.6 MB | #3 | 14%  2025-05-07T20:25:39.7102683Z 2025-05-07T20:25:39.7102690Z 2025-05-07T20:25:39.7102696Z 2025-05-07T20:25:39.7102710Z 2025-05-07T20:25:39.7102716Z 2025-05-07T20:25:39.7103022Z 2025-05-07T20:25:39.7103029Z 2025-05-07T20:25:39.7103293Z 2025-05-07T20:25:39.7443635Z cuda-nvrtc-12.8.61 | 63.1 MB | ###4 | 35%  2025-05-07T20:25:39.7444037Z 2025-05-07T20:25:39.7444042Z 2025-05-07T20:25:39.7444047Z 2025-05-07T20:25:39.7444053Z 2025-05-07T20:25:39.7444058Z 2025-05-07T20:25:39.7444063Z 2025-05-07T20:25:39.7444068Z 2025-05-07T20:25:39.7444074Z 2025-05-07T20:25:39.7445729Z 2025-05-07T20:25:39.8232538Z libcurand-10.3.9.55 | 43.6 MB | ## | 21%  2025-05-07T20:25:39.8232992Z 2025-05-07T20:25:39.8232996Z 2025-05-07T20:25:39.8233000Z 2025-05-07T20:25:39.8233004Z 2025-05-07T20:25:39.8233007Z 2025-05-07T20:25:39.8233011Z 2025-05-07T20:25:39.8233015Z 2025-05-07T20:25:39.8233918Z 2025-05-07T20:25:39.8455226Z cuda-nvrtc-12.8.61 | 63.1 MB | ###9 | 40%  2025-05-07T20:25:39.8455589Z 2025-05-07T20:25:39.8455593Z 2025-05-07T20:25:39.8455597Z 2025-05-07T20:25:39.8455615Z 2025-05-07T20:25:39.8455619Z 2025-05-07T20:25:39.8455623Z 2025-05-07T20:25:39.8455626Z 2025-05-07T20:25:39.8455630Z 2025-05-07T20:25:39.8458360Z 2025-05-07T20:25:39.9233314Z libcurand-10.3.9.55 | 43.6 MB | ##7 | 28%  2025-05-07T20:25:39.9233657Z 2025-05-07T20:25:39.9233661Z 2025-05-07T20:25:39.9233665Z 2025-05-07T20:25:39.9233668Z 2025-05-07T20:25:39.9233672Z 2025-05-07T20:25:39.9233676Z 2025-05-07T20:25:39.9233680Z 2025-05-07T20:25:39.9234480Z 2025-05-07T20:25:39.9459402Z cuda-nvrtc-12.8.61 | 63.1 MB | ####5 | 45%  2025-05-07T20:25:39.9459722Z 2025-05-07T20:25:39.9459728Z 2025-05-07T20:25:39.9459731Z 2025-05-07T20:25:39.9459735Z 2025-05-07T20:25:39.9459739Z 2025-05-07T20:25:39.9459764Z 2025-05-07T20:25:39.9459768Z 2025-05-07T20:25:39.9459771Z 2025-05-07T20:25:39.9461168Z 2025-05-07T20:25:40.0357013Z libcurand-10.3.9.55 | 43.6 MB | ###5 | 35%  2025-05-07T20:25:40.0357373Z 2025-05-07T20:25:40.0357378Z 2025-05-07T20:25:40.0357382Z 2025-05-07T20:25:40.0357386Z 2025-05-07T20:25:40.0357389Z 2025-05-07T20:25:40.0357393Z 2025-05-07T20:25:40.0357397Z 2025-05-07T20:25:40.0357400Z 2025-05-07T20:25:40.0545753Z cuda-nvrtc-12.8.61 | 63.1 MB | ##### | 50%  2025-05-07T20:25:40.0546061Z 2025-05-07T20:25:40.0546065Z 2025-05-07T20:25:40.0546069Z 2025-05-07T20:25:40.0546073Z 2025-05-07T20:25:40.0546076Z 2025-05-07T20:25:40.0546080Z 2025-05-07T20:25:40.0546083Z 2025-05-07T20:25:40.0546095Z 2025-05-07T20:25:40.0556513Z 2025-05-07T20:25:40.0838550Z libcurand-10.3.9.55 | 43.6 MB | ####2 | 43%  2025-05-07T20:25:40.0838972Z 2025-05-07T20:25:40.0838978Z 2025-05-07T20:25:40.0839008Z 2025-05-07T20:25:40.0839014Z 2025-05-07T20:25:40.0839018Z 2025-05-07T20:25:40.0839022Z 2025-05-07T20:25:40.0839025Z 2025-05-07T20:25:40.1359098Z cuda-nvvp-12.8.57 | 112.4 MB | ########## | 100%  2025-05-07T20:25:40.1359449Z 2025-05-07T20:25:40.1359453Z 2025-05-07T20:25:40.1359457Z 2025-05-07T20:25:40.1359461Z 2025-05-07T20:25:40.1359465Z 2025-05-07T20:25:40.1359477Z 2025-05-07T20:25:40.1359481Z 2025-05-07T20:25:40.1359485Z 2025-05-07T20:25:40.1484402Z cuda-nvrtc-12.8.61 | 63.1 MB | #####5 | 56%  2025-05-07T20:25:40.1484707Z 2025-05-07T20:25:40.1484711Z 2025-05-07T20:25:40.1484723Z 2025-05-07T20:25:40.1484727Z 2025-05-07T20:25:40.1484730Z 2025-05-07T20:25:40.1484734Z 2025-05-07T20:25:40.1484737Z 2025-05-07T20:25:40.1484741Z 2025-05-07T20:25:40.1484744Z 2025-05-07T20:25:40.1484748Z 2025-05-07T20:25:40.1576401Z gds-tools-1.13.0.11 | 37.9 MB | | 0%  2025-05-07T20:25:40.1576738Z 2025-05-07T20:25:40.1576742Z 2025-05-07T20:25:40.1576746Z 2025-05-07T20:25:40.1576749Z 2025-05-07T20:25:40.1576753Z 2025-05-07T20:25:40.1576757Z 2025-05-07T20:25:40.1576760Z 2025-05-07T20:25:40.1576764Z 2025-05-07T20:25:40.1579226Z 2025-05-07T20:25:40.2492281Z libcurand-10.3.9.55 | 43.6 MB | ####9 | 49%  2025-05-07T20:25:40.2492620Z 2025-05-07T20:25:40.2492624Z 2025-05-07T20:25:40.2492628Z 2025-05-07T20:25:40.2492631Z 2025-05-07T20:25:40.2492635Z 2025-05-07T20:25:40.2492639Z 2025-05-07T20:25:40.2492642Z 2025-05-07T20:25:40.2492647Z 2025-05-07T20:25:40.2492650Z 2025-05-07T20:25:40.2492879Z 2025-05-07T20:25:40.2586451Z gds-tools-1.13.0.11 | 37.9 MB | 7 | 8%  2025-05-07T20:25:40.2586851Z 2025-05-07T20:25:40.2586857Z 2025-05-07T20:25:40.2586862Z 2025-05-07T20:25:40.2586867Z 2025-05-07T20:25:40.2586872Z 2025-05-07T20:25:40.2586877Z 2025-05-07T20:25:40.2586885Z 2025-05-07T20:25:40.2588426Z 2025-05-07T20:25:40.2652375Z cuda-nvrtc-12.8.61 | 63.1 MB | ###### | 61%  2025-05-07T20:25:40.2652676Z 2025-05-07T20:25:40.2652680Z 2025-05-07T20:25:40.2652684Z 2025-05-07T20:25:40.2652687Z 2025-05-07T20:25:40.2652698Z 2025-05-07T20:25:40.2652716Z 2025-05-07T20:25:40.2652722Z 2025-05-07T20:25:40.2652728Z 2025-05-07T20:25:40.2656703Z 2025-05-07T20:25:40.3494528Z libcurand-10.3.9.55 | 43.6 MB | #####6 | 56%  2025-05-07T20:25:40.3494870Z 2025-05-07T20:25:40.3494875Z 2025-05-07T20:25:40.3494880Z 2025-05-07T20:25:40.3494885Z 2025-05-07T20:25:40.3494905Z 2025-05-07T20:25:40.3494910Z 2025-05-07T20:25:40.3494915Z 2025-05-07T20:25:40.3494921Z 2025-05-07T20:25:40.3494927Z 2025-05-07T20:25:40.3498145Z 2025-05-07T20:25:40.3646142Z gds-tools-1.13.0.11 | 37.9 MB | #5 | 15%  2025-05-07T20:25:40.3646490Z 2025-05-07T20:25:40.3646495Z 2025-05-07T20:25:40.3646508Z 2025-05-07T20:25:40.3646512Z 2025-05-07T20:25:40.3646545Z 2025-05-07T20:25:40.3646549Z 2025-05-07T20:25:40.3646554Z 2025-05-07T20:25:40.3646563Z 2025-05-07T20:25:40.3654790Z cuda-nvrtc-12.8.61 | 63.1 MB | ######5 | 65%  2025-05-07T20:25:40.3655101Z 2025-05-07T20:25:40.3655122Z 2025-05-07T20:25:40.3655126Z 2025-05-07T20:25:40.3655130Z 2025-05-07T20:25:40.3655134Z 2025-05-07T20:25:40.3655137Z 2025-05-07T20:25:40.3655141Z 2025-05-07T20:25:40.3655145Z 2025-05-07T20:25:40.3657500Z 2025-05-07T20:25:40.4496518Z libcurand-10.3.9.55 | 43.6 MB | ######3 | 64%  2025-05-07T20:25:40.4496923Z 2025-05-07T20:25:40.4496927Z 2025-05-07T20:25:40.4496938Z 2025-05-07T20:25:40.4496942Z 2025-05-07T20:25:40.4496947Z 2025-05-07T20:25:40.4496950Z 2025-05-07T20:25:40.4496954Z 2025-05-07T20:25:40.4496958Z 2025-05-07T20:25:40.4496961Z 2025-05-07T20:25:40.4496966Z 2025-05-07T20:25:40.4711449Z gds-tools-1.13.0.11 | 37.9 MB | ##3 | 23%  2025-05-07T20:25:40.4711827Z 2025-05-07T20:25:40.4711833Z 2025-05-07T20:25:40.4711838Z 2025-05-07T20:25:40.4711843Z 2025-05-07T20:25:40.4711848Z 2025-05-07T20:25:40.4711853Z 2025-05-07T20:25:40.4711858Z 2025-05-07T20:25:40.4711863Z 2025-05-07T20:25:40.4846803Z cuda-nvrtc-12.8.61 | 63.1 MB | ####### | 70%  2025-05-07T20:25:40.4847203Z 2025-05-07T20:25:40.4847210Z 2025-05-07T20:25:40.4847215Z 2025-05-07T20:25:40.4847220Z 2025-05-07T20:25:40.4847225Z 2025-05-07T20:25:40.4847230Z 2025-05-07T20:25:40.4847235Z 2025-05-07T20:25:40.4847241Z 2025-05-07T20:25:40.4847246Z 2025-05-07T20:25:40.5640687Z libcurand-10.3.9.55 | 43.6 MB | ####### | 71%  2025-05-07T20:25:40.5641072Z 2025-05-07T20:25:40.5641078Z 2025-05-07T20:25:40.5641083Z 2025-05-07T20:25:40.5641088Z 2025-05-07T20:25:40.5641093Z 2025-05-07T20:25:40.5641098Z 2025-05-07T20:25:40.5641104Z 2025-05-07T20:25:40.5641110Z 2025-05-07T20:25:40.5641115Z 2025-05-07T20:25:40.5653171Z 2025-05-07T20:25:40.5732124Z gds-tools-1.13.0.11 | 37.9 MB | ### | 31%  2025-05-07T20:25:40.5732479Z 2025-05-07T20:25:40.5732483Z 2025-05-07T20:25:40.5732487Z 2025-05-07T20:25:40.5732491Z 2025-05-07T20:25:40.5732759Z 2025-05-07T20:25:40.5732765Z 2025-05-07T20:25:40.5732776Z 2025-05-07T20:25:40.5732780Z 2025-05-07T20:25:40.5849046Z cuda-nvrtc-12.8.61 | 63.1 MB | #######4 | 75%  2025-05-07T20:25:40.5849364Z 2025-05-07T20:25:40.5849371Z 2025-05-07T20:25:40.5849376Z 2025-05-07T20:25:40.5849390Z 2025-05-07T20:25:40.5849396Z 2025-05-07T20:25:40.5849402Z 2025-05-07T20:25:40.5849408Z 2025-05-07T20:25:40.5849413Z 2025-05-07T20:25:40.5849419Z 2025-05-07T20:25:40.6738706Z libcurand-10.3.9.55 | 43.6 MB | #######7 | 78%  2025-05-07T20:25:40.6739040Z 2025-05-07T20:25:40.6739047Z 2025-05-07T20:25:40.6739053Z 2025-05-07T20:25:40.6739058Z 2025-05-07T20:25:40.6739064Z 2025-05-07T20:25:40.6739069Z 2025-05-07T20:25:40.6739425Z 2025-05-07T20:25:40.6739430Z 2025-05-07T20:25:40.6777167Z cuda-nvrtc-12.8.61 | 63.1 MB | #######9 | 79%  2025-05-07T20:25:40.6777626Z 2025-05-07T20:25:40.6777633Z 2025-05-07T20:25:40.6777638Z 2025-05-07T20:25:40.6777666Z 2025-05-07T20:25:40.6777673Z 2025-05-07T20:25:40.6777678Z 2025-05-07T20:25:40.6777684Z 2025-05-07T20:25:40.6777689Z 2025-05-07T20:25:40.6777695Z 2025-05-07T20:25:40.6778728Z 2025-05-07T20:25:40.6856750Z gds-tools-1.13.0.11 | 37.9 MB | ###8 | 38%  2025-05-07T20:25:40.6857106Z 2025-05-07T20:25:40.6857110Z 2025-05-07T20:25:40.6857113Z 2025-05-07T20:25:40.6857117Z 2025-05-07T20:25:40.6857121Z 2025-05-07T20:25:40.6857124Z 2025-05-07T20:25:40.6857128Z 2025-05-07T20:25:40.6857139Z 2025-05-07T20:25:40.6857147Z 2025-05-07T20:25:40.7779487Z libcurand-10.3.9.55 | 43.6 MB | ########4 | 84%  2025-05-07T20:25:40.7779928Z 2025-05-07T20:25:40.7779932Z 2025-05-07T20:25:40.7779946Z 2025-05-07T20:25:40.7779985Z 2025-05-07T20:25:40.7779989Z 2025-05-07T20:25:40.7779992Z 2025-05-07T20:25:40.7779996Z 2025-05-07T20:25:40.7780000Z 2025-05-07T20:25:40.7780003Z 2025-05-07T20:25:40.7781270Z 2025-05-07T20:25:40.7830233Z gds-tools-1.13.0.11 | 37.9 MB | ####5 | 46%  2025-05-07T20:25:40.7830546Z 2025-05-07T20:25:40.7830550Z 2025-05-07T20:25:40.7830554Z 2025-05-07T20:25:40.7830565Z 2025-05-07T20:25:40.7830569Z 2025-05-07T20:25:40.7830572Z 2025-05-07T20:25:40.7830576Z 2025-05-07T20:25:40.7830579Z 2025-05-07T20:25:40.8004478Z cuda-nvrtc-12.8.61 | 63.1 MB | ########4 | 84%  2025-05-07T20:25:40.8004791Z 2025-05-07T20:25:40.8004795Z 2025-05-07T20:25:40.8004799Z 2025-05-07T20:25:40.8004802Z 2025-05-07T20:25:40.8004806Z 2025-05-07T20:25:40.8004810Z 2025-05-07T20:25:40.8004813Z 2025-05-07T20:25:40.8004817Z 2025-05-07T20:25:40.8005040Z 2025-05-07T20:25:40.8780962Z libcurand-10.3.9.55 | 43.6 MB | #########1 | 91%  2025-05-07T20:25:40.8781438Z 2025-05-07T20:25:40.8781442Z 2025-05-07T20:25:40.8781446Z 2025-05-07T20:25:40.8781450Z 2025-05-07T20:25:40.8781453Z 2025-05-07T20:25:40.8781457Z 2025-05-07T20:25:40.8781461Z 2025-05-07T20:25:40.8781477Z 2025-05-07T20:25:40.8781482Z 2025-05-07T20:25:40.8781851Z 2025-05-07T20:25:40.8856594Z gds-tools-1.13.0.11 | 37.9 MB | #####3 | 53%  2025-05-07T20:25:40.8856906Z 2025-05-07T20:25:40.8856910Z 2025-05-07T20:25:40.8856914Z 2025-05-07T20:25:40.8856918Z 2025-05-07T20:25:40.8856921Z 2025-05-07T20:25:40.8856925Z 2025-05-07T20:25:40.8856929Z 2025-05-07T20:25:40.8858133Z 2025-05-07T20:25:40.9037617Z cuda-nvrtc-12.8.61 | 63.1 MB | ########8 | 89%  2025-05-07T20:25:40.9037912Z 2025-05-07T20:25:40.9038186Z 2025-05-07T20:25:40.9038201Z 2025-05-07T20:25:40.9038212Z 2025-05-07T20:25:40.9038221Z 2025-05-07T20:25:40.9038231Z 2025-05-07T20:25:40.9038241Z 2025-05-07T20:25:40.9038286Z 2025-05-07T20:25:40.9040289Z 2025-05-07T20:25:40.9783570Z libcurand-10.3.9.55 | 43.6 MB | #########7 | 98%  2025-05-07T20:25:40.9783914Z 2025-05-07T20:25:40.9783920Z 2025-05-07T20:25:40.9783925Z 2025-05-07T20:25:40.9784232Z 2025-05-07T20:25:40.9784242Z 2025-05-07T20:25:40.9784247Z 2025-05-07T20:25:40.9784255Z 2025-05-07T20:25:40.9784261Z 2025-05-07T20:25:40.9784267Z 2025-05-07T20:25:40.9789653Z 2025-05-07T20:25:41.0165941Z gds-tools-1.13.0.11 | 37.9 MB | ###### | 61%  2025-05-07T20:25:41.0166259Z 2025-05-07T20:25:41.0166264Z 2025-05-07T20:25:41.0166269Z 2025-05-07T20:25:41.0166274Z 2025-05-07T20:25:41.0166286Z 2025-05-07T20:25:41.0166291Z 2025-05-07T20:25:41.0166296Z 2025-05-07T20:25:41.0166735Z 2025-05-07T20:25:41.0817130Z cuda-nvrtc-12.8.61 | 63.1 MB | #########3 | 93%  2025-05-07T20:25:41.0817537Z 2025-05-07T20:25:41.0817542Z 2025-05-07T20:25:41.0817546Z 2025-05-07T20:25:41.0817551Z 2025-05-07T20:25:41.0817838Z 2025-05-07T20:25:41.0817842Z 2025-05-07T20:25:41.0817846Z 2025-05-07T20:25:41.0817850Z 2025-05-07T20:25:41.0817853Z 2025-05-07T20:25:41.0817857Z 2025-05-07T20:25:41.1167075Z gds-tools-1.13.0.11 | 37.9 MB | ######7 | 68%  2025-05-07T20:25:41.1167505Z 2025-05-07T20:25:41.1167512Z 2025-05-07T20:25:41.1167517Z 2025-05-07T20:25:41.1167522Z 2025-05-07T20:25:41.1167528Z 2025-05-07T20:25:41.1167532Z 2025-05-07T20:25:41.1167537Z 2025-05-07T20:25:41.1169896Z 2025-05-07T20:25:41.1817200Z cuda-nvrtc-12.8.61 | 63.1 MB | #########8 | 98%  2025-05-07T20:25:41.1817537Z 2025-05-07T20:25:41.1817541Z 2025-05-07T20:25:41.1817544Z 2025-05-07T20:25:41.1817548Z 2025-05-07T20:25:41.1817552Z 2025-05-07T20:25:41.1817556Z 2025-05-07T20:25:41.1817559Z 2025-05-07T20:25:41.1817564Z 2025-05-07T20:25:41.1817568Z 2025-05-07T20:25:41.1817571Z 2025-05-07T20:25:41.2822027Z gds-tools-1.13.0.11 | 37.9 MB | #######6 | 76%  2025-05-07T20:25:41.2822393Z 2025-05-07T20:25:41.2822397Z 2025-05-07T20:25:41.2822401Z 2025-05-07T20:25:41.2822405Z 2025-05-07T20:25:41.2822409Z 2025-05-07T20:25:41.2822412Z 2025-05-07T20:25:41.2822416Z 2025-05-07T20:25:41.2822432Z 2025-05-07T20:25:41.2822448Z 2025-05-07T20:25:41.2822451Z 2025-05-07T20:25:41.3826746Z gds-tools-1.13.0.11 | 37.9 MB | ########6 | 86%  2025-05-07T20:25:41.3827080Z 2025-05-07T20:25:41.3827084Z 2025-05-07T20:25:41.3827096Z 2025-05-07T20:25:41.3827100Z 2025-05-07T20:25:41.3827103Z 2025-05-07T20:25:41.3827107Z 2025-05-07T20:25:41.3827111Z 2025-05-07T20:25:41.3827115Z 2025-05-07T20:25:41.3827118Z 2025-05-07T20:25:41.3827122Z 2025-05-07T20:25:42.3986999Z gds-tools-1.13.0.11 | 37.9 MB | #########4 | 95%  2025-05-07T20:25:42.3987340Z 2025-05-07T20:25:42.3987344Z 2025-05-07T20:25:42.3987348Z 2025-05-07T20:25:42.3987351Z 2025-05-07T20:25:42.3987355Z 2025-05-07T20:25:42.3987385Z 2025-05-07T20:25:42.3987389Z 2025-05-07T20:25:42.3987392Z 2025-05-07T20:25:42.3987396Z 2025-05-07T20:25:42.4602747Z libcurand-10.3.9.55 | 43.6 MB | ########## | 100%  2025-05-07T20:25:42.4603063Z 2025-05-07T20:25:42.4603082Z 2025-05-07T20:25:42.4603087Z 2025-05-07T20:25:42.4603094Z 2025-05-07T20:25:42.4603098Z 2025-05-07T20:25:42.4603102Z 2025-05-07T20:25:42.4603106Z 2025-05-07T20:25:42.4603110Z 2025-05-07T20:25:42.4603113Z 2025-05-07T20:25:42.4603117Z 2025-05-07T20:25:42.4605482Z 2025-05-07T20:25:42.5607422Z libnvjitlink-12.8.61 | 28.7 MB | | 0%  2025-05-07T20:25:42.5607751Z 2025-05-07T20:25:42.5607756Z 2025-05-07T20:25:42.5607759Z 2025-05-07T20:25:42.5607763Z 2025-05-07T20:25:42.5607767Z 2025-05-07T20:25:42.5607771Z 2025-05-07T20:25:42.5607775Z 2025-05-07T20:25:42.5607779Z 2025-05-07T20:25:42.5607791Z 2025-05-07T20:25:42.5607795Z 2025-05-07T20:25:42.5609614Z 2025-05-07T20:25:42.6626518Z libnvjitlink-12.8.61 | 28.7 MB | #2 | 12%  2025-05-07T20:25:42.6626880Z 2025-05-07T20:25:42.6626885Z 2025-05-07T20:25:42.6626888Z 2025-05-07T20:25:42.6626892Z 2025-05-07T20:25:42.6626896Z 2025-05-07T20:25:42.6627181Z 2025-05-07T20:25:42.6627186Z 2025-05-07T20:25:42.6627190Z 2025-05-07T20:25:42.6627193Z 2025-05-07T20:25:42.6627197Z 2025-05-07T20:25:42.7330486Z gds-tools-1.13.0.11 | 37.9 MB | ########## | 100%  2025-05-07T20:25:42.7330801Z 2025-05-07T20:25:42.7330805Z 2025-05-07T20:25:42.7330808Z 2025-05-07T20:25:42.7330812Z 2025-05-07T20:25:42.7330815Z 2025-05-07T20:25:42.7330819Z 2025-05-07T20:25:42.7330823Z 2025-05-07T20:25:42.7330826Z 2025-05-07T20:25:42.7330830Z 2025-05-07T20:25:42.7330833Z 2025-05-07T20:25:42.7330837Z 2025-05-07T20:25:42.7331833Z 2025-05-07T20:25:42.7751064Z cuda-nvcc-tools-12.8 | 24.5 MB | | 0%  2025-05-07T20:25:42.7751398Z 2025-05-07T20:25:42.7751671Z 2025-05-07T20:25:42.7751675Z 2025-05-07T20:25:42.7751679Z 2025-05-07T20:25:42.7751682Z 2025-05-07T20:25:42.7751686Z 2025-05-07T20:25:42.7751690Z 2025-05-07T20:25:42.7751693Z 2025-05-07T20:25:42.7751697Z 2025-05-07T20:25:42.7751720Z 2025-05-07T20:25:42.7751724Z 2025-05-07T20:25:42.8333529Z libnvjitlink-12.8.61 | 28.7 MB | ##4 | 25%  2025-05-07T20:25:42.8333859Z 2025-05-07T20:25:42.8333863Z 2025-05-07T20:25:42.8333875Z 2025-05-07T20:25:42.8333879Z 2025-05-07T20:25:42.8333883Z 2025-05-07T20:25:42.8333886Z 2025-05-07T20:25:42.8333890Z 2025-05-07T20:25:42.8333893Z 2025-05-07T20:25:42.8333897Z 2025-05-07T20:25:42.8333900Z 2025-05-07T20:25:42.8333904Z 2025-05-07T20:25:42.8333907Z 2025-05-07T20:25:42.8753986Z cuda-nvcc-tools-12.8 | 24.5 MB | #1 | 12%  2025-05-07T20:25:42.8754377Z 2025-05-07T20:25:42.8754383Z 2025-05-07T20:25:42.8754388Z 2025-05-07T20:25:42.8754395Z 2025-05-07T20:25:42.8754400Z 2025-05-07T20:25:42.8754437Z 2025-05-07T20:25:42.8754442Z 2025-05-07T20:25:42.8754448Z 2025-05-07T20:25:42.8754453Z 2025-05-07T20:25:42.8754459Z 2025-05-07T20:25:42.8754464Z 2025-05-07T20:25:42.9336914Z libnvjitlink-12.8.61 | 28.7 MB | ###3 | 33%  2025-05-07T20:25:42.9337246Z 2025-05-07T20:25:42.9337251Z 2025-05-07T20:25:42.9337255Z 2025-05-07T20:25:42.9337258Z 2025-05-07T20:25:42.9337262Z 2025-05-07T20:25:42.9337266Z 2025-05-07T20:25:42.9337270Z 2025-05-07T20:25:42.9337273Z 2025-05-07T20:25:42.9337285Z 2025-05-07T20:25:42.9337289Z 2025-05-07T20:25:42.9337292Z 2025-05-07T20:25:42.9337296Z 2025-05-07T20:25:42.9755535Z cuda-nvcc-tools-12.8 | 24.5 MB | ##3 | 23%  2025-05-07T20:25:42.9755909Z 2025-05-07T20:25:42.9755914Z 2025-05-07T20:25:42.9755918Z 2025-05-07T20:25:42.9755921Z 2025-05-07T20:25:42.9755925Z 2025-05-07T20:25:42.9755929Z 2025-05-07T20:25:42.9755934Z 2025-05-07T20:25:42.9755938Z 2025-05-07T20:25:42.9755962Z 2025-05-07T20:25:42.9755966Z 2025-05-07T20:25:42.9755970Z 2025-05-07T20:25:43.0582247Z libnvjitlink-12.8.61 | 28.7 MB | ####2 | 42%  2025-05-07T20:25:43.0582582Z 2025-05-07T20:25:43.0582610Z 2025-05-07T20:25:43.0582614Z 2025-05-07T20:25:43.0582617Z 2025-05-07T20:25:43.0582621Z 2025-05-07T20:25:43.0582625Z 2025-05-07T20:25:43.0582628Z 2025-05-07T20:25:43.0582632Z 2025-05-07T20:25:43.0582635Z 2025-05-07T20:25:43.0582639Z 2025-05-07T20:25:43.0582642Z 2025-05-07T20:25:43.0588502Z 2025-05-07T20:25:43.0610498Z cuda-nvcc-tools-12.8 | 24.5 MB | ###5 | 35%  2025-05-07T20:25:43.0610895Z 2025-05-07T20:25:43.0610900Z 2025-05-07T20:25:43.0761337Z libcusparse-12.5.7.5 | 164.9 MB | ########## | 100%  2025-05-07T20:25:43.0770679Z 2025-05-07T20:25:43.0770687Z 2025-05-07T20:25:43.0770692Z 2025-05-07T20:25:43.0770698Z 2025-05-07T20:25:43.0770703Z 2025-05-07T20:25:43.0770708Z 2025-05-07T20:25:43.0770732Z 2025-05-07T20:25:43.0770737Z 2025-05-07T20:25:43.0770742Z 2025-05-07T20:25:43.0770748Z 2025-05-07T20:25:43.0770752Z 2025-05-07T20:25:43.1317142Z libnvjitlink-12.8.61 | 28.7 MB | #####1 | 52%  2025-05-07T20:25:43.1317497Z 2025-05-07T20:25:43.1317502Z 2025-05-07T20:25:43.1317505Z 2025-05-07T20:25:43.1317509Z 2025-05-07T20:25:43.1317513Z 2025-05-07T20:25:43.1317516Z 2025-05-07T20:25:43.1317520Z 2025-05-07T20:25:43.1317524Z 2025-05-07T20:25:43.1694316Z cuda-nvrtc-12.8.61 | 63.1 MB | ########## | 100%  2025-05-07T20:25:43.1694717Z 2025-05-07T20:25:43.1694725Z 2025-05-07T20:25:43.1694729Z 2025-05-07T20:25:43.1694734Z 2025-05-07T20:25:43.1694739Z 2025-05-07T20:25:43.1694744Z 2025-05-07T20:25:43.1694749Z 2025-05-07T20:25:43.1694754Z 2025-05-07T20:25:43.1694760Z 2025-05-07T20:25:43.1694767Z 2025-05-07T20:25:43.1694773Z 2025-05-07T20:25:43.1694780Z 2025-05-07T20:25:43.1695805Z 2025-05-07T20:25:43.1775866Z cuda-nvvm-tools-12.8 | 23.5 MB | | 0%  2025-05-07T20:25:43.1776455Z 2025-05-07T20:25:43.1776460Z 2025-05-07T20:25:43.1776464Z 2025-05-07T20:25:43.1776468Z 2025-05-07T20:25:43.1776479Z 2025-05-07T20:25:43.1776492Z 2025-05-07T20:25:43.1776496Z 2025-05-07T20:25:43.1776499Z 2025-05-07T20:25:43.1776503Z 2025-05-07T20:25:43.1776507Z 2025-05-07T20:25:43.1776510Z 2025-05-07T20:25:43.1777952Z 2025-05-07T20:25:43.2381659Z cuda-nvcc-tools-12.8 | 24.5 MB | ####5 | 46%  2025-05-07T20:25:43.2381994Z 2025-05-07T20:25:43.2381999Z 2025-05-07T20:25:43.2382002Z 2025-05-07T20:25:43.2382006Z 2025-05-07T20:25:43.2382010Z 2025-05-07T20:25:43.2382014Z 2025-05-07T20:25:43.2382018Z 2025-05-07T20:25:43.2382021Z 2025-05-07T20:25:43.2382025Z 2025-05-07T20:25:43.2382028Z 2025-05-07T20:25:43.2383564Z 2025-05-07T20:25:43.2695183Z libnvjitlink-12.8.61 | 28.7 MB | ###### | 60%  2025-05-07T20:25:43.2695613Z 2025-05-07T20:25:43.2695617Z 2025-05-07T20:25:43.2695621Z 2025-05-07T20:25:43.2695625Z 2025-05-07T20:25:43.2695629Z 2025-05-07T20:25:43.2695632Z 2025-05-07T20:25:43.2695636Z 2025-05-07T20:25:43.2695640Z 2025-05-07T20:25:43.2695654Z 2025-05-07T20:25:43.2695658Z 2025-05-07T20:25:43.2695661Z 2025-05-07T20:25:43.2695674Z 2025-05-07T20:25:43.2695677Z 2025-05-07T20:25:43.2779912Z cuda-nvvm-tools-12.8 | 23.5 MB | #1 | 12%  2025-05-07T20:25:43.2780261Z 2025-05-07T20:25:43.2780265Z 2025-05-07T20:25:43.2780277Z 2025-05-07T20:25:43.2780281Z 2025-05-07T20:25:43.2780284Z 2025-05-07T20:25:43.2780288Z 2025-05-07T20:25:43.2780291Z 2025-05-07T20:25:43.2780295Z 2025-05-07T20:25:43.2780299Z 2025-05-07T20:25:43.2780302Z 2025-05-07T20:25:43.2780306Z 2025-05-07T20:25:43.2780309Z 2025-05-07T20:25:43.3384401Z cuda-nvcc-tools-12.8 | 24.5 MB | #####6 | 56%  2025-05-07T20:25:43.3384754Z 2025-05-07T20:25:43.3384776Z 2025-05-07T20:25:43.3384780Z 2025-05-07T20:25:43.3384783Z 2025-05-07T20:25:43.3384787Z 2025-05-07T20:25:43.3384792Z 2025-05-07T20:25:43.3384795Z 2025-05-07T20:25:43.3384800Z 2025-05-07T20:25:43.3384803Z 2025-05-07T20:25:43.3384818Z 2025-05-07T20:25:43.3387188Z 2025-05-07T20:25:43.3697612Z libnvjitlink-12.8.61 | 28.7 MB | ######9 | 69%  2025-05-07T20:25:43.3697960Z 2025-05-07T20:25:43.3697964Z 2025-05-07T20:25:43.3697968Z 2025-05-07T20:25:43.3697971Z 2025-05-07T20:25:43.3697975Z 2025-05-07T20:25:43.3697979Z 2025-05-07T20:25:43.3697982Z 2025-05-07T20:25:43.3697986Z 2025-05-07T20:25:43.3697989Z 2025-05-07T20:25:43.3697993Z 2025-05-07T20:25:43.3698005Z 2025-05-07T20:25:43.3698009Z 2025-05-07T20:25:43.3700404Z 2025-05-07T20:25:43.3781726Z cuda-nvvm-tools-12.8 | 23.5 MB | ##4 | 24%  2025-05-07T20:25:43.3782194Z 2025-05-07T20:25:43.3782199Z 2025-05-07T20:25:43.3782202Z 2025-05-07T20:25:43.3782206Z 2025-05-07T20:25:43.3782225Z 2025-05-07T20:25:43.3782229Z 2025-05-07T20:25:43.3782233Z 2025-05-07T20:25:43.3782237Z 2025-05-07T20:25:43.3782240Z 2025-05-07T20:25:43.3782244Z 2025-05-07T20:25:43.3782248Z 2025-05-07T20:25:43.3782251Z 2025-05-07T20:25:43.4277456Z cuda-nvcc-tools-12.8 | 24.5 MB | ######8 | 68%  2025-05-07T20:25:43.4277800Z 2025-05-07T20:25:43.4277805Z 2025-05-07T20:25:43.4281086Z 2025-05-07T20:25:43.4392132Z libcusolver-11.7.2.5 | 156.9 MB | ########## | 100%  2025-05-07T20:25:43.4392501Z 2025-05-07T20:25:43.4392505Z 2025-05-07T20:25:43.4392508Z 2025-05-07T20:25:43.4392512Z 2025-05-07T20:25:43.4392515Z 2025-05-07T20:25:43.4392523Z 2025-05-07T20:25:43.4392527Z 2025-05-07T20:25:43.4392532Z 2025-05-07T20:25:43.4392535Z 2025-05-07T20:25:43.4392539Z 2025-05-07T20:25:43.4396440Z 2025-05-07T20:25:43.4698664Z libnvjitlink-12.8.61 | 28.7 MB | #######7 | 78%  2025-05-07T20:25:43.4699106Z 2025-05-07T20:25:43.4699349Z 2025-05-07T20:25:43.4699353Z 2025-05-07T20:25:43.4699357Z 2025-05-07T20:25:43.4699361Z 2025-05-07T20:25:43.4699364Z 2025-05-07T20:25:43.4699368Z 2025-05-07T20:25:43.4699372Z 2025-05-07T20:25:43.4699376Z 2025-05-07T20:25:43.4699386Z 2025-05-07T20:25:43.4699390Z 2025-05-07T20:25:43.4699394Z 2025-05-07T20:25:43.4699397Z 2025-05-07T20:25:43.4782125Z cuda-nvvm-tools-12.8 | 23.5 MB | ###6 | 37%  2025-05-07T20:25:43.4782452Z 2025-05-07T20:25:43.4782456Z 2025-05-07T20:25:43.4782460Z 2025-05-07T20:25:43.4782464Z 2025-05-07T20:25:43.4782476Z 2025-05-07T20:25:43.4782480Z 2025-05-07T20:25:43.4782484Z 2025-05-07T20:25:43.4782487Z 2025-05-07T20:25:43.4782492Z 2025-05-07T20:25:43.4782495Z 2025-05-07T20:25:43.4782499Z 2025-05-07T20:25:43.4785131Z 2025-05-07T20:25:43.5394580Z cuda-nvcc-tools-12.8 | 24.5 MB | ######## | 80%  2025-05-07T20:25:43.5394920Z 2025-05-07T20:25:43.5394924Z 2025-05-07T20:25:43.5394956Z 2025-05-07T20:25:43.5394960Z 2025-05-07T20:25:43.5394964Z 2025-05-07T20:25:43.5394968Z 2025-05-07T20:25:43.5394971Z 2025-05-07T20:25:43.5394975Z 2025-05-07T20:25:43.5394979Z 2025-05-07T20:25:43.5394982Z 2025-05-07T20:25:43.5397283Z 2025-05-07T20:25:43.5701295Z libnvjitlink-12.8.61 | 28.7 MB | ########6 | 86%  2025-05-07T20:25:43.5701716Z 2025-05-07T20:25:43.5701722Z 2025-05-07T20:25:43.5701725Z 2025-05-07T20:25:43.5701730Z 2025-05-07T20:25:43.5701734Z 2025-05-07T20:25:43.5701738Z 2025-05-07T20:25:43.5701742Z 2025-05-07T20:25:43.5701746Z 2025-05-07T20:25:43.5701750Z 2025-05-07T20:25:43.5701753Z 2025-05-07T20:25:43.5701757Z 2025-05-07T20:25:43.5701761Z 2025-05-07T20:25:43.5701773Z 2025-05-07T20:25:43.5802659Z cuda-nvvm-tools-12.8 | 23.5 MB | ####9 | 50%  2025-05-07T20:25:43.5803075Z 2025-05-07T20:25:43.5803079Z 2025-05-07T20:25:43.5803083Z 2025-05-07T20:25:43.5803098Z 2025-05-07T20:25:43.5803117Z 2025-05-07T20:25:43.5803121Z 2025-05-07T20:25:43.5803126Z 2025-05-07T20:25:43.5803129Z 2025-05-07T20:25:43.5803133Z 2025-05-07T20:25:43.5803137Z 2025-05-07T20:25:43.5803140Z 2025-05-07T20:25:43.5803144Z 2025-05-07T20:25:43.6399640Z cuda-nvcc-tools-12.8 | 24.5 MB | #########1 | 91%  2025-05-07T20:25:43.6399973Z 2025-05-07T20:25:43.6399977Z 2025-05-07T20:25:43.6399980Z 2025-05-07T20:25:43.6399984Z 2025-05-07T20:25:43.6399988Z 2025-05-07T20:25:43.6399991Z 2025-05-07T20:25:43.6399995Z 2025-05-07T20:25:43.6399999Z 2025-05-07T20:25:43.6400002Z 2025-05-07T20:25:43.6400006Z 2025-05-07T20:25:43.6400010Z 2025-05-07T20:25:43.6721105Z libnvjitlink-12.8.61 | 28.7 MB | #########5 | 95%  2025-05-07T20:25:43.6721424Z 2025-05-07T20:25:43.6721428Z 2025-05-07T20:25:43.6721432Z 2025-05-07T20:25:43.6721436Z 2025-05-07T20:25:43.6721439Z 2025-05-07T20:25:43.6721444Z 2025-05-07T20:25:43.6721447Z 2025-05-07T20:25:43.6721464Z 2025-05-07T20:25:43.6721468Z 2025-05-07T20:25:43.6721471Z 2025-05-07T20:25:43.6721475Z 2025-05-07T20:25:43.6721487Z 2025-05-07T20:25:43.6721491Z 2025-05-07T20:25:43.7734815Z cuda-nvvm-tools-12.8 | 23.5 MB | ######2 | 62%  2025-05-07T20:25:43.7735153Z 2025-05-07T20:25:43.7735158Z 2025-05-07T20:25:43.7735170Z 2025-05-07T20:25:43.7735174Z 2025-05-07T20:25:43.7735177Z 2025-05-07T20:25:43.7735181Z 2025-05-07T20:25:43.7735184Z 2025-05-07T20:25:43.7735189Z 2025-05-07T20:25:43.7735197Z 2025-05-07T20:25:43.7735201Z 2025-05-07T20:25:43.7735206Z 2025-05-07T20:25:43.7735210Z 2025-05-07T20:25:43.7735508Z 2025-05-07T20:25:43.8735884Z cuda-nvvm-tools-12.8 | 23.5 MB | #######6 | 76%  2025-05-07T20:25:43.8736230Z 2025-05-07T20:25:43.8736234Z 2025-05-07T20:25:43.8736238Z 2025-05-07T20:25:43.8736241Z 2025-05-07T20:25:43.8736245Z 2025-05-07T20:25:43.8736248Z 2025-05-07T20:25:43.8736252Z 2025-05-07T20:25:43.8736502Z 2025-05-07T20:25:43.8736505Z 2025-05-07T20:25:43.8736509Z 2025-05-07T20:25:43.8736513Z 2025-05-07T20:25:43.8736516Z 2025-05-07T20:25:43.8741946Z 2025-05-07T20:25:44.4318826Z cuda-nvvm-tools-12.8 | 23.5 MB | #########1 | 91%  2025-05-07T20:25:44.4319310Z 2025-05-07T20:25:44.4319316Z 2025-05-07T20:25:44.4319321Z 2025-05-07T20:25:44.4319326Z 2025-05-07T20:25:44.4319341Z 2025-05-07T20:25:44.4319347Z 2025-05-07T20:25:44.4319352Z 2025-05-07T20:25:44.4319357Z 2025-05-07T20:25:44.4319361Z 2025-05-07T20:25:44.4319366Z 2025-05-07T20:25:44.4319371Z 2025-05-07T20:25:44.4319375Z 2025-05-07T20:25:44.5016723Z cuda-nvcc-tools-12.8 | 24.5 MB | ########## | 100%  2025-05-07T20:25:44.5017224Z 2025-05-07T20:25:44.5017231Z 2025-05-07T20:25:44.5017237Z 2025-05-07T20:25:44.5017242Z 2025-05-07T20:25:44.5017248Z 2025-05-07T20:25:44.5017253Z 2025-05-07T20:25:44.5017260Z 2025-05-07T20:25:44.5017267Z 2025-05-07T20:25:44.5017299Z 2025-05-07T20:25:44.5017304Z 2025-05-07T20:25:44.5017310Z 2025-05-07T20:25:44.5017315Z 2025-05-07T20:25:44.5017320Z 2025-05-07T20:25:44.5017325Z 2025-05-07T20:25:44.5532690Z cuda-nvvm-impl-12.8. | 20.8 MB | | 0%  2025-05-07T20:25:44.5533027Z 2025-05-07T20:25:44.5999280Z nsight-compute-2025. | 320.6 MB | ########## | 100%  2025-05-07T20:25:44.5999717Z 2025-05-07T20:25:44.5999724Z 2025-05-07T20:25:44.5999731Z 2025-05-07T20:25:44.5999737Z 2025-05-07T20:25:44.5999743Z 2025-05-07T20:25:44.5999748Z 2025-05-07T20:25:44.5999752Z 2025-05-07T20:25:44.5999757Z 2025-05-07T20:25:44.5999764Z 2025-05-07T20:25:44.5999769Z 2025-05-07T20:25:44.5999775Z 2025-05-07T20:25:44.5999780Z 2025-05-07T20:25:44.5999786Z 2025-05-07T20:25:44.5999791Z 2025-05-07T20:25:44.6001463Z 2025-05-07T20:25:44.6019289Z cuda-nvcc-dev_linux- | 12.7 MB | | 0%  2025-05-07T20:25:44.6019626Z 2025-05-07T20:25:44.6019651Z 2025-05-07T20:25:44.6019655Z 2025-05-07T20:25:44.6019659Z 2025-05-07T20:25:44.6019673Z 2025-05-07T20:25:44.6019678Z 2025-05-07T20:25:44.6019681Z 2025-05-07T20:25:44.6019685Z 2025-05-07T20:25:44.6019689Z 2025-05-07T20:25:44.6019703Z 2025-05-07T20:25:44.6019706Z 2025-05-07T20:25:44.6019710Z 2025-05-07T20:25:44.6019713Z 2025-05-07T20:25:44.6019717Z 2025-05-07T20:25:44.6552256Z cuda-nvvm-impl-12.8. | 20.8 MB | #7 | 18%  2025-05-07T20:25:44.6552760Z 2025-05-07T20:25:44.6552767Z 2025-05-07T20:25:44.6552772Z 2025-05-07T20:25:44.6552777Z 2025-05-07T20:25:44.6552781Z 2025-05-07T20:25:44.6552786Z 2025-05-07T20:25:44.6552791Z 2025-05-07T20:25:44.6552797Z 2025-05-07T20:25:44.6552802Z 2025-05-07T20:25:44.6552807Z 2025-05-07T20:25:44.6552812Z 2025-05-07T20:25:44.6552818Z 2025-05-07T20:25:44.6552822Z 2025-05-07T20:25:44.6931083Z cuda-nvvm-tools-12.8 | 23.5 MB | ########## | 100%  2025-05-07T20:25:44.6931451Z 2025-05-07T20:25:44.6931455Z 2025-05-07T20:25:44.6931459Z 2025-05-07T20:25:44.6931463Z 2025-05-07T20:25:44.6931466Z 2025-05-07T20:25:44.6931470Z 2025-05-07T20:25:44.6931474Z 2025-05-07T20:25:44.6931477Z 2025-05-07T20:25:44.6931727Z 2025-05-07T20:25:44.6931732Z 2025-05-07T20:25:44.6931736Z 2025-05-07T20:25:44.6931740Z 2025-05-07T20:25:44.6931743Z 2025-05-07T20:25:44.6931747Z 2025-05-07T20:25:44.6931751Z 2025-05-07T20:25:44.6931754Z 2025-05-07T20:25:44.7005076Z cuda-sanitizer-api-1 | 8.8 MB | | 0%  2025-05-07T20:25:44.7005557Z 2025-05-07T20:25:44.7005563Z 2025-05-07T20:25:44.7005568Z 2025-05-07T20:25:44.7005573Z 2025-05-07T20:25:44.7005578Z 2025-05-07T20:25:44.7005584Z 2025-05-07T20:25:44.7005589Z 2025-05-07T20:25:44.7005594Z 2025-05-07T20:25:44.7005599Z 2025-05-07T20:25:44.7005604Z 2025-05-07T20:25:44.7005609Z 2025-05-07T20:25:44.7005614Z 2025-05-07T20:25:44.7005620Z 2025-05-07T20:25:44.7005625Z 2025-05-07T20:25:44.7006877Z 2025-05-07T20:25:44.7039031Z cuda-nvcc-dev_linux- | 12.7 MB | ##3 | 23%  2025-05-07T20:25:44.7039484Z 2025-05-07T20:25:44.7039489Z 2025-05-07T20:25:44.7039494Z 2025-05-07T20:25:44.7039512Z 2025-05-07T20:25:44.7039518Z 2025-05-07T20:25:44.7039523Z 2025-05-07T20:25:44.7039528Z 2025-05-07T20:25:44.7039533Z 2025-05-07T20:25:44.7039551Z 2025-05-07T20:25:44.7039556Z 2025-05-07T20:25:44.7039561Z 2025-05-07T20:25:44.7039566Z 2025-05-07T20:25:44.7039572Z 2025-05-07T20:25:44.7039576Z 2025-05-07T20:25:44.7095207Z cuda-nvvm-impl-12.8. | 20.8 MB | ###5 | 35%  2025-05-07T20:25:44.7095665Z 2025-05-07T20:25:44.7095909Z 2025-05-07T20:25:44.7095916Z 2025-05-07T20:25:44.7095921Z 2025-05-07T20:25:44.7095926Z 2025-05-07T20:25:44.7095931Z 2025-05-07T20:25:44.7095936Z 2025-05-07T20:25:44.7095941Z 2025-05-07T20:25:44.7095947Z 2025-05-07T20:25:44.7095951Z 2025-05-07T20:25:44.7095983Z 2025-05-07T20:25:44.7582958Z libnvjitlink-12.8.61 | 28.7 MB | ########## | 100%  2025-05-07T20:25:44.7583399Z 2025-05-07T20:25:44.7583406Z 2025-05-07T20:25:44.7583411Z 2025-05-07T20:25:44.7583416Z 2025-05-07T20:25:44.7583438Z 2025-05-07T20:25:44.7583444Z 2025-05-07T20:25:44.7583449Z 2025-05-07T20:25:44.7583454Z 2025-05-07T20:25:44.7583459Z 2025-05-07T20:25:44.7583464Z 2025-05-07T20:25:44.7583470Z 2025-05-07T20:25:44.7583475Z 2025-05-07T20:25:44.7583480Z 2025-05-07T20:25:44.7583485Z 2025-05-07T20:25:44.7583490Z 2025-05-07T20:25:44.7583495Z 2025-05-07T20:25:44.7588083Z 2025-05-07T20:25:44.7931710Z cuda-nvdisasm-12.8.5 | 4.9 MB | | 0%  2025-05-07T20:25:44.7932286Z 2025-05-07T20:25:44.7932292Z 2025-05-07T20:25:44.7932307Z 2025-05-07T20:25:44.7932312Z 2025-05-07T20:25:44.7932318Z 2025-05-07T20:25:44.7932323Z 2025-05-07T20:25:44.7932328Z 2025-05-07T20:25:44.7932334Z 2025-05-07T20:25:44.7932357Z 2025-05-07T20:25:44.7932362Z 2025-05-07T20:25:44.7932368Z 2025-05-07T20:25:44.7932373Z 2025-05-07T20:25:44.7932378Z 2025-05-07T20:25:44.7932383Z 2025-05-07T20:25:44.7932388Z 2025-05-07T20:25:44.7932393Z 2025-05-07T20:25:44.8102301Z cuda-sanitizer-api-1 | 8.8 MB | ###4 | 35%  2025-05-07T20:25:44.8102782Z 2025-05-07T20:25:44.8102789Z 2025-05-07T20:25:44.8102794Z 2025-05-07T20:25:44.8102799Z 2025-05-07T20:25:44.8102804Z 2025-05-07T20:25:44.8102809Z 2025-05-07T20:25:44.8102814Z 2025-05-07T20:25:44.8102819Z 2025-05-07T20:25:44.8102824Z 2025-05-07T20:25:44.8102830Z 2025-05-07T20:25:44.8102835Z 2025-05-07T20:25:44.8102840Z 2025-05-07T20:25:44.8102845Z 2025-05-07T20:25:44.8102859Z 2025-05-07T20:25:44.8106507Z 2025-05-07T20:25:44.8306117Z cuda-nvcc-dev_linux- | 12.7 MB | ####5 | 46%  2025-05-07T20:25:44.8306760Z 2025-05-07T20:25:44.8306776Z 2025-05-07T20:25:44.8306781Z 2025-05-07T20:25:44.8306799Z 2025-05-07T20:25:44.8306805Z 2025-05-07T20:25:44.8306810Z 2025-05-07T20:25:44.8306815Z 2025-05-07T20:25:44.8306820Z 2025-05-07T20:25:44.8306825Z 2025-05-07T20:25:44.8306830Z 2025-05-07T20:25:44.8306835Z 2025-05-07T20:25:44.8307081Z 2025-05-07T20:25:44.8307087Z 2025-05-07T20:25:44.8308380Z 2025-05-07T20:25:44.8583187Z cuda-nvvm-impl-12.8. | 20.8 MB | #####2 | 53%  2025-05-07T20:25:44.8583653Z 2025-05-07T20:25:44.8583659Z 2025-05-07T20:25:44.8583664Z 2025-05-07T20:25:44.8583670Z 2025-05-07T20:25:44.8583674Z 2025-05-07T20:25:44.8583680Z 2025-05-07T20:25:44.8583685Z 2025-05-07T20:25:44.8583690Z 2025-05-07T20:25:44.8583695Z 2025-05-07T20:25:44.8583700Z 2025-05-07T20:25:44.8583705Z 2025-05-07T20:25:44.8583711Z 2025-05-07T20:25:44.8583716Z 2025-05-07T20:25:44.8583721Z 2025-05-07T20:25:44.8583735Z 2025-05-07T20:25:44.8583741Z 2025-05-07T20:25:44.8584963Z 2025-05-07T20:25:44.9134464Z cuda-nvdisasm-12.8.5 | 4.9 MB | #####6 | 57%  2025-05-07T20:25:44.9135260Z 2025-05-07T20:25:44.9135265Z 2025-05-07T20:25:44.9135269Z 2025-05-07T20:25:44.9135272Z 2025-05-07T20:25:44.9135276Z 2025-05-07T20:25:44.9135288Z 2025-05-07T20:25:44.9135292Z 2025-05-07T20:25:44.9135295Z 2025-05-07T20:25:44.9135299Z 2025-05-07T20:25:44.9135303Z 2025-05-07T20:25:44.9135307Z 2025-05-07T20:25:44.9135310Z 2025-05-07T20:25:44.9135314Z 2025-05-07T20:25:44.9135318Z 2025-05-07T20:25:44.9135321Z 2025-05-07T20:25:44.9136720Z 2025-05-07T20:25:44.9234833Z cuda-sanitizer-api-1 | 8.8 MB | ######9 | 69%  2025-05-07T20:25:44.9235308Z 2025-05-07T20:25:44.9235313Z 2025-05-07T20:25:44.9235319Z 2025-05-07T20:25:44.9235324Z 2025-05-07T20:25:44.9235329Z 2025-05-07T20:25:44.9235334Z 2025-05-07T20:25:44.9235340Z 2025-05-07T20:25:44.9235345Z 2025-05-07T20:25:44.9235350Z 2025-05-07T20:25:44.9235367Z 2025-05-07T20:25:44.9235372Z 2025-05-07T20:25:44.9235392Z 2025-05-07T20:25:44.9235397Z 2025-05-07T20:25:44.9235402Z 2025-05-07T20:25:44.9239787Z 2025-05-07T20:25:44.9482612Z cuda-nvcc-dev_linux- | 12.7 MB | ######7 | 68%  2025-05-07T20:25:44.9483095Z 2025-05-07T20:25:44.9483101Z 2025-05-07T20:25:44.9483107Z 2025-05-07T20:25:44.9483112Z 2025-05-07T20:25:44.9483117Z 2025-05-07T20:25:44.9483122Z 2025-05-07T20:25:44.9483127Z 2025-05-07T20:25:44.9483132Z 2025-05-07T20:25:44.9483138Z 2025-05-07T20:25:44.9483143Z 2025-05-07T20:25:44.9483148Z 2025-05-07T20:25:44.9483153Z 2025-05-07T20:25:44.9483158Z 2025-05-07T20:25:44.9489998Z 2025-05-07T20:25:45.0235656Z cuda-nvvm-impl-12.8. | 20.8 MB | ######8 | 68%  2025-05-07T20:25:45.0236101Z 2025-05-07T20:25:45.0236107Z 2025-05-07T20:25:45.0236112Z 2025-05-07T20:25:45.0236117Z 2025-05-07T20:25:45.0236122Z 2025-05-07T20:25:45.0236128Z 2025-05-07T20:25:45.0236133Z 2025-05-07T20:25:45.0236138Z 2025-05-07T20:25:45.0236177Z 2025-05-07T20:25:45.0236182Z 2025-05-07T20:25:45.0236187Z 2025-05-07T20:25:45.0236192Z 2025-05-07T20:25:45.0236197Z 2025-05-07T20:25:45.0236202Z 2025-05-07T20:25:45.0237501Z 2025-05-07T20:25:45.0545271Z cuda-nvcc-dev_linux- | 12.7 MB | #########1 | 91%  2025-05-07T20:25:45.0545737Z 2025-05-07T20:25:45.0545742Z 2025-05-07T20:25:45.0545747Z 2025-05-07T20:25:45.0545753Z 2025-05-07T20:25:45.0545758Z 2025-05-07T20:25:45.0545763Z 2025-05-07T20:25:45.0545768Z 2025-05-07T20:25:45.0545773Z 2025-05-07T20:25:45.0545778Z 2025-05-07T20:25:45.0545783Z 2025-05-07T20:25:45.0545788Z 2025-05-07T20:25:45.0545793Z 2025-05-07T20:25:45.0545798Z 2025-05-07T20:25:45.0545804Z 2025-05-07T20:25:45.1069063Z cuda-nvvm-impl-12.8. | 20.8 MB | ########3 | 83%  2025-05-07T20:25:45.1069506Z 2025-05-07T20:25:45.1069512Z 2025-05-07T20:25:45.1069517Z 2025-05-07T20:25:45.1069522Z 2025-05-07T20:25:45.1069527Z 2025-05-07T20:25:45.1069550Z 2025-05-07T20:25:45.1069555Z 2025-05-07T20:25:45.1069560Z 2025-05-07T20:25:45.1069565Z 2025-05-07T20:25:45.1069570Z 2025-05-07T20:25:45.1069585Z 2025-05-07T20:25:45.1069590Z 2025-05-07T20:25:45.1069595Z 2025-05-07T20:25:45.1069846Z 2025-05-07T20:25:45.1069852Z 2025-05-07T20:25:45.1069856Z 2025-05-07T20:25:45.1070440Z 2025-05-07T20:25:45.1548009Z cuda-nvdisasm-12.8.5 | 4.9 MB | ########## | 100%  2025-05-07T20:25:45.1548493Z 2025-05-07T20:25:45.1548499Z 2025-05-07T20:25:45.1548504Z 2025-05-07T20:25:45.1548509Z 2025-05-07T20:25:45.1548514Z 2025-05-07T20:25:45.1548519Z 2025-05-07T20:25:45.1548523Z 2025-05-07T20:25:45.1548531Z 2025-05-07T20:25:45.1548536Z 2025-05-07T20:25:45.1548542Z 2025-05-07T20:25:45.1548549Z 2025-05-07T20:25:45.1548555Z 2025-05-07T20:25:45.1548560Z 2025-05-07T20:25:45.1548566Z 2025-05-07T20:25:45.1548571Z 2025-05-07T20:25:45.1548576Z 2025-05-07T20:25:45.1548581Z 2025-05-07T20:25:45.1548841Z 2025-05-07T20:25:45.2549621Z cuda-cupti-dev-12.8. | 4.0 MB | | 0%  2025-05-07T20:25:45.2550075Z 2025-05-07T20:25:45.2550081Z 2025-05-07T20:25:45.2550086Z 2025-05-07T20:25:45.2550124Z 2025-05-07T20:25:45.2550129Z 2025-05-07T20:25:45.2550135Z 2025-05-07T20:25:45.2550139Z 2025-05-07T20:25:45.2550145Z 2025-05-07T20:25:45.2550150Z 2025-05-07T20:25:45.2550155Z 2025-05-07T20:25:45.2550160Z 2025-05-07T20:25:45.2550165Z 2025-05-07T20:25:45.2550170Z 2025-05-07T20:25:45.2550175Z 2025-05-07T20:25:45.2550180Z 2025-05-07T20:25:45.2550186Z 2025-05-07T20:25:45.2550191Z 2025-05-07T20:25:45.2551228Z 2025-05-07T20:25:45.3650165Z cuda-cupti-dev-12.8. | 4.0 MB | ######## | 81%  2025-05-07T20:25:45.3650632Z 2025-05-07T20:25:45.3650639Z 2025-05-07T20:25:45.3650644Z 2025-05-07T20:25:45.3650649Z 2025-05-07T20:25:45.3650655Z 2025-05-07T20:25:45.3650662Z 2025-05-07T20:25:45.3650689Z 2025-05-07T20:25:45.3650694Z 2025-05-07T20:25:45.3650699Z 2025-05-07T20:25:45.3650704Z 2025-05-07T20:25:45.3650709Z 2025-05-07T20:25:45.3650714Z 2025-05-07T20:25:45.3650728Z 2025-05-07T20:25:45.3650733Z 2025-05-07T20:25:45.3650749Z 2025-05-07T20:25:45.3650754Z 2025-05-07T20:25:45.4045091Z cuda-sanitizer-api-1 | 8.8 MB | ########## | 100%  2025-05-07T20:25:45.4045573Z 2025-05-07T20:25:45.4045579Z 2025-05-07T20:25:45.4045584Z 2025-05-07T20:25:45.4045589Z 2025-05-07T20:25:45.4045594Z 2025-05-07T20:25:45.4045609Z 2025-05-07T20:25:45.4045614Z 2025-05-07T20:25:45.4045619Z 2025-05-07T20:25:45.4045624Z 2025-05-07T20:25:45.4045629Z 2025-05-07T20:25:45.4045634Z 2025-05-07T20:25:45.4045639Z 2025-05-07T20:25:45.4045644Z 2025-05-07T20:25:45.4045649Z 2025-05-07T20:25:45.4045655Z 2025-05-07T20:25:45.4045659Z 2025-05-07T20:25:45.4045665Z 2025-05-07T20:25:45.4045669Z 2025-05-07T20:25:45.4045675Z 2025-05-07T20:25:45.4068093Z ... (more hidden) ... 2025-05-07T20:25:45.4068506Z 2025-05-07T20:25:45.4068512Z 2025-05-07T20:25:45.4068517Z 2025-05-07T20:25:45.4068522Z 2025-05-07T20:25:45.4068527Z 2025-05-07T20:25:45.4068539Z 2025-05-07T20:25:45.4068544Z 2025-05-07T20:25:45.4068559Z 2025-05-07T20:25:45.4068565Z 2025-05-07T20:25:45.4068570Z 2025-05-07T20:25:45.4068575Z 2025-05-07T20:25:45.4068580Z 2025-05-07T20:25:45.4068585Z 2025-05-07T20:25:45.4068590Z 2025-05-07T20:25:45.4068595Z 2025-05-07T20:25:45.4068600Z 2025-05-07T20:25:45.4068605Z 2025-05-07T20:25:45.4072869Z 2025-05-07T20:25:45.4991115Z cuda-cupti-dev-12.8. | 4.0 MB | ########## | 100%  2025-05-07T20:25:45.4991598Z 2025-05-07T20:25:45.4991604Z 2025-05-07T20:25:45.4991608Z 2025-05-07T20:25:45.4991614Z 2025-05-07T20:25:45.4991619Z 2025-05-07T20:25:45.4991624Z 2025-05-07T20:25:45.4991629Z 2025-05-07T20:25:45.4991633Z 2025-05-07T20:25:45.4991639Z 2025-05-07T20:25:45.4991659Z 2025-05-07T20:25:45.4991664Z 2025-05-07T20:25:45.4991669Z 2025-05-07T20:25:45.4991674Z 2025-05-07T20:25:45.4991680Z 2025-05-07T20:25:45.4991685Z 2025-05-07T20:25:45.5047900Z cuda-nvcc-dev_linux- | 12.7 MB | ########## | 100%  2025-05-07T20:25:45.5048372Z 2025-05-07T20:25:45.5048377Z 2025-05-07T20:25:45.5048382Z 2025-05-07T20:25:45.5048399Z 2025-05-07T20:25:45.5048404Z 2025-05-07T20:25:45.5048409Z 2025-05-07T20:25:45.5048414Z 2025-05-07T20:25:45.5048419Z 2025-05-07T20:25:45.5048424Z 2025-05-07T20:25:45.5048429Z 2025-05-07T20:25:45.5048434Z 2025-05-07T20:25:45.5048439Z 2025-05-07T20:25:45.5048444Z 2025-05-07T20:25:45.5048449Z 2025-05-07T20:25:45.5048454Z 2025-05-07T20:25:45.5048459Z 2025-05-07T20:25:45.5048464Z 2025-05-07T20:25:45.5048469Z 2025-05-07T20:25:45.5048474Z 2025-05-07T20:25:45.6695438Z ... (more hidden) ... 2025-05-07T20:25:45.6695851Z 2025-05-07T20:25:45.6696153Z 2025-05-07T20:25:45.6696157Z 2025-05-07T20:25:45.6696161Z 2025-05-07T20:25:45.6696165Z 2025-05-07T20:25:45.6696168Z 2025-05-07T20:25:45.6696172Z 2025-05-07T20:25:45.6696176Z 2025-05-07T20:25:45.6696179Z 2025-05-07T20:25:45.6696198Z 2025-05-07T20:25:45.6696202Z 2025-05-07T20:25:45.6696221Z 2025-05-07T20:25:45.6696224Z 2025-05-07T20:25:45.6696228Z 2025-05-07T20:25:45.6696231Z 2025-05-07T20:25:45.6696235Z 2025-05-07T20:25:45.6696239Z 2025-05-07T20:25:45.6696242Z 2025-05-07T20:25:45.6698338Z 2025-05-07T20:25:45.7984662Z ... (more hidden) ... 2025-05-07T20:25:45.7985081Z 2025-05-07T20:25:45.7985087Z 2025-05-07T20:25:45.7985092Z 2025-05-07T20:25:45.7985097Z 2025-05-07T20:25:45.7985102Z 2025-05-07T20:25:45.7985108Z 2025-05-07T20:25:45.7985113Z 2025-05-07T20:25:45.7985118Z 2025-05-07T20:25:45.7985124Z 2025-05-07T20:25:45.7985129Z 2025-05-07T20:25:45.7985134Z 2025-05-07T20:25:45.7985140Z 2025-05-07T20:25:45.7985145Z 2025-05-07T20:25:45.7987742Z 2025-05-07T20:25:45.7988314Z cuda-nvvm-impl-12.8. | 20.8 MB | ########## | 100%  2025-05-07T20:25:45.7988737Z 2025-05-07T20:25:45.7988742Z 2025-05-07T20:25:45.7988748Z 2025-05-07T20:25:45.7988767Z 2025-05-07T20:25:45.7988772Z 2025-05-07T20:25:45.7988785Z 2025-05-07T20:25:45.7988790Z 2025-05-07T20:25:45.7988796Z 2025-05-07T20:25:45.7988813Z 2025-05-07T20:25:45.7988819Z 2025-05-07T20:25:45.7988824Z 2025-05-07T20:25:45.7988829Z 2025-05-07T20:25:45.7988834Z 2025-05-07T20:25:45.7988839Z 2025-05-07T20:25:46.3785132Z cuda-nvvm-impl-12.8. | 20.8 MB | ########## | 100%  2025-05-07T20:25:46.3785553Z 2025-05-07T20:25:46.3785557Z 2025-05-07T20:25:46.3785561Z 2025-05-07T20:25:46.3785565Z 2025-05-07T20:25:46.3785569Z 2025-05-07T20:25:46.3785572Z 2025-05-07T20:25:47.6749554Z cuda-nsight-12.8.55 | 113.2 MB | ########## | 100%  2025-05-07T20:25:47.6749937Z 2025-05-07T20:25:47.6749970Z 2025-05-07T20:25:47.6749974Z 2025-05-07T20:25:47.6749978Z 2025-05-07T20:25:47.6750499Z 2025-05-07T20:25:47.9341275Z libnpp-12.3.3.65 | 130.6 MB | ########## | 100%  2025-05-07T20:25:47.9341686Z 2025-05-07T20:25:47.9341724Z 2025-05-07T20:25:47.9341741Z 2025-05-07T20:25:47.9341747Z 2025-05-07T20:25:47.9341752Z 2025-05-07T20:25:47.9341758Z 2025-05-07T20:25:47.9341763Z 2025-05-07T20:25:47.9341768Z 2025-05-07T20:25:47.9341773Z 2025-05-07T20:25:48.0383598Z libcurand-10.3.9.55 | 43.6 MB | ########## | 100%  2025-05-07T20:25:48.0384076Z libcublas-12.8.3.14 | 460.2 MB | ########## | 100% 2025-05-07T20:25:48.1159237Z libcublas-12.8.3.14 | 460.2 MB | ########## | 100% 2025-05-07T20:25:48.1159537Z 2025-05-07T20:25:48.1159541Z 2025-05-07T20:25:48.1159545Z 2025-05-07T20:25:48.1159549Z 2025-05-07T20:25:48.1159553Z 2025-05-07T20:25:48.1159558Z 2025-05-07T20:25:48.1159564Z 2025-05-07T20:25:48.1498515Z cuda-nvvp-12.8.57 | 112.4 MB | ########## | 100%  2025-05-07T20:25:48.1498864Z 2025-05-07T20:25:48.1498869Z 2025-05-07T20:25:48.1498872Z 2025-05-07T20:25:48.1498876Z 2025-05-07T20:25:48.1498880Z 2025-05-07T20:25:48.1498884Z 2025-05-07T20:25:48.1499149Z 2025-05-07T20:25:48.1499154Z 2025-05-07T20:25:48.1499158Z 2025-05-07T20:25:48.1499161Z 2025-05-07T20:25:48.5161447Z gds-tools-1.13.0.11 | 37.9 MB | ########## | 100%  2025-05-07T20:25:48.5161920Z 2025-05-07T20:25:48.5161927Z 2025-05-07T20:25:48.5161932Z 2025-05-07T20:25:48.5161937Z 2025-05-07T20:25:48.5161942Z 2025-05-07T20:25:48.5161947Z 2025-05-07T20:25:48.5161952Z 2025-05-07T20:25:48.5161958Z 2025-05-07T20:25:48.5161963Z 2025-05-07T20:25:48.5161968Z 2025-05-07T20:25:48.5161974Z 2025-05-07T20:25:48.5161979Z 2025-05-07T20:25:48.8343984Z cuda-nvcc-tools-12.8 | 24.5 MB | ########## | 100%  2025-05-07T20:25:48.8344326Z 2025-05-07T20:25:48.8344331Z 2025-05-07T20:25:48.8344347Z 2025-05-07T20:25:48.8344658Z 2025-05-07T20:25:48.8344662Z 2025-05-07T20:25:48.8344666Z 2025-05-07T20:25:48.8344671Z 2025-05-07T20:25:48.8344674Z 2025-05-07T20:25:48.8344679Z 2025-05-07T20:25:48.8344683Z 2025-05-07T20:25:48.8344687Z 2025-05-07T20:25:48.8344707Z 2025-05-07T20:25:48.8344711Z 2025-05-07T20:25:48.9216471Z cuda-nvvm-tools-12.8 | 23.5 MB | ########## | 100%  2025-05-07T20:25:48.9216820Z 2025-05-07T20:25:48.9216824Z 2025-05-07T20:25:48.9216827Z 2025-05-07T20:25:48.9216835Z 2025-05-07T20:25:48.9216839Z 2025-05-07T20:25:48.9216843Z 2025-05-07T20:25:48.9216846Z 2025-05-07T20:25:48.9216854Z 2025-05-07T20:25:48.9556063Z cuda-nvrtc-12.8.61 | 63.1 MB | ########## | 100%  2025-05-07T20:25:48.9556368Z 2025-05-07T20:25:48.9556373Z 2025-05-07T20:25:48.9556376Z 2025-05-07T20:25:48.9556380Z 2025-05-07T20:25:48.9556384Z 2025-05-07T20:25:48.9556388Z 2025-05-07T20:25:48.9556401Z 2025-05-07T20:25:48.9556406Z 2025-05-07T20:25:48.9556437Z 2025-05-07T20:25:48.9556441Z 2025-05-07T20:25:48.9556445Z 2025-05-07T20:25:48.9556448Z 2025-05-07T20:25:48.9556452Z 2025-05-07T20:25:48.9556455Z 2025-05-07T20:25:48.9556460Z 2025-05-07T20:25:48.9556463Z 2025-05-07T20:25:48.9556477Z 2025-05-07T20:25:49.1358318Z cuda-nvdisasm-12.8.5 | 4.9 MB | ########## | 100%  2025-05-07T20:25:49.1358677Z 2025-05-07T20:25:49.1358681Z 2025-05-07T20:25:49.1358685Z 2025-05-07T20:25:49.1358688Z 2025-05-07T20:25:49.1358693Z 2025-05-07T20:25:49.1358696Z 2025-05-07T20:25:49.1358701Z 2025-05-07T20:25:49.1358704Z 2025-05-07T20:25:49.1358709Z 2025-05-07T20:25:49.1358713Z 2025-05-07T20:25:49.1358717Z 2025-05-07T20:25:49.1358720Z 2025-05-07T20:25:49.1358724Z 2025-05-07T20:25:49.1358728Z 2025-05-07T20:25:49.1358731Z 2025-05-07T20:25:49.1358735Z 2025-05-07T20:25:49.2740882Z cuda-sanitizer-api-1 | 8.8 MB | ########## | 100%  2025-05-07T20:25:49.2741242Z 2025-05-07T20:25:49.2741277Z 2025-05-07T20:25:49.2741281Z 2025-05-07T20:25:49.2741284Z 2025-05-07T20:25:49.2741288Z 2025-05-07T20:25:49.2741302Z 2025-05-07T20:25:49.2741306Z 2025-05-07T20:25:49.2741310Z 2025-05-07T20:25:49.2741314Z 2025-05-07T20:25:49.2741331Z 2025-05-07T20:25:49.2741335Z 2025-05-07T20:25:49.2794474Z libnvjitlink-12.8.61 | 28.7 MB | ########## | 100%  2025-05-07T20:25:49.2794798Z 2025-05-07T20:25:49.2794802Z 2025-05-07T20:25:49.2794806Z 2025-05-07T20:25:49.2794809Z 2025-05-07T20:25:49.2794813Z 2025-05-07T20:25:49.2794818Z 2025-05-07T20:25:49.2794821Z 2025-05-07T20:25:49.2794825Z 2025-05-07T20:25:49.2794829Z 2025-05-07T20:25:49.2794832Z 2025-05-07T20:25:49.2794836Z 2025-05-07T20:25:49.2794840Z 2025-05-07T20:25:49.2794843Z 2025-05-07T20:25:49.2794847Z 2025-05-07T20:25:49.2794850Z 2025-05-07T20:25:49.2794854Z 2025-05-07T20:25:49.2794858Z 2025-05-07T20:25:49.2795188Z 2025-05-07T20:25:49.4914186Z cuda-cupti-dev-12.8. | 4.0 MB | ########## | 100%  2025-05-07T20:25:49.4914571Z 2025-05-07T20:25:49.4914575Z 2025-05-07T20:25:49.4914580Z 2025-05-07T20:25:49.4914583Z 2025-05-07T20:25:49.4914587Z 2025-05-07T20:25:49.4914826Z 2025-05-07T20:25:49.4914831Z 2025-05-07T20:25:49.4914835Z 2025-05-07T20:25:49.4914839Z 2025-05-07T20:25:49.4914842Z 2025-05-07T20:25:49.4914846Z 2025-05-07T20:25:49.4914850Z 2025-05-07T20:25:49.4914853Z 2025-05-07T20:25:49.4914857Z 2025-05-07T20:25:49.4914860Z 2025-05-07T20:25:49.4914864Z 2025-05-07T20:25:49.4914868Z 2025-05-07T20:25:49.4914871Z 2025-05-07T20:25:49.4914875Z 2025-05-07T20:25:49.6935359Z ... (more hidden) ... 2025-05-07T20:25:49.6935750Z 2025-05-07T20:25:49.6935755Z 2025-05-07T20:25:49.6935759Z 2025-05-07T20:25:49.6935763Z 2025-05-07T20:25:49.6935775Z 2025-05-07T20:25:49.6935779Z 2025-05-07T20:25:49.6935783Z 2025-05-07T20:25:49.6935788Z 2025-05-07T20:25:49.6935792Z 2025-05-07T20:25:49.6936065Z 2025-05-07T20:25:49.6936069Z 2025-05-07T20:25:49.6936073Z 2025-05-07T20:25:49.6936076Z 2025-05-07T20:25:49.6936080Z 2025-05-07T20:25:49.6936083Z 2025-05-07T20:25:50.1341538Z cuda-nvcc-dev_linux- | 12.7 MB | ########## | 100%  2025-05-07T20:25:50.1341918Z 2025-05-07T20:25:50.1341923Z 2025-05-07T20:25:50.1341927Z 2025-05-07T20:25:50.1341930Z 2025-05-07T20:25:50.1341934Z 2025-05-07T20:25:50.1341938Z 2025-05-07T20:25:50.1341941Z 2025-05-07T20:25:50.1341945Z 2025-05-07T20:25:50.1341949Z 2025-05-07T20:25:50.1341953Z 2025-05-07T20:25:50.1341956Z 2025-05-07T20:25:50.1341960Z 2025-05-07T20:25:50.1341964Z 2025-05-07T20:25:50.1341967Z 2025-05-07T20:25:54.5330296Z cuda-nvvm-impl-12.8. | 20.8 MB | ########## | 100%  2025-05-07T20:25:54.5330656Z 2025-05-07T20:25:55.3913382Z nsight-compute-2025. | 320.6 MB | ########## | 100%  2025-05-07T20:25:55.3921608Z libcublas-12.8.3.14 | 460.2 MB | ########## | 100% 2025-05-07T20:25:55.3921893Z 2025-05-07T20:25:55.3922325Z 2025-05-07T20:25:55.3922343Z 2025-05-07T20:25:55.3922354Z 2025-05-07T20:25:55.3922363Z 2025-05-07T20:25:55.3922373Z 2025-05-07T20:25:55.3922383Z 2025-05-07T20:25:55.3922430Z 2025-05-07T20:25:55.3922440Z 2025-05-07T20:25:55.3922449Z 2025-05-07T20:25:55.3922469Z 2025-05-07T20:25:55.3922475Z 2025-05-07T20:25:55.3922481Z 2025-05-07T20:25:55.3922487Z 2025-05-07T20:25:55.3922494Z 2025-05-07T20:25:55.3922500Z 2025-05-07T20:25:55.3922504Z 2025-05-07T20:25:55.3922509Z 2025-05-07T20:25:55.3922513Z 2025-05-07T20:25:55.3922935Z 2025-05-07T20:25:55.3923501Z  2025-05-07T20:25:55.3923830Z 2025-05-07T20:25:55.3924043Z 2025-05-07T20:25:55.3924205Z  2025-05-07T20:25:55.3924407Z 2025-05-07T20:25:55.3924411Z 2025-05-07T20:25:55.3924588Z  2025-05-07T20:25:55.3924805Z 2025-05-07T20:25:55.3924809Z 2025-05-07T20:25:55.3924812Z 2025-05-07T20:25:55.3924981Z  2025-05-07T20:25:55.3925187Z 2025-05-07T20:25:55.3925191Z 2025-05-07T20:25:55.3925194Z 2025-05-07T20:25:55.3925198Z 2025-05-07T20:25:55.3925376Z  2025-05-07T20:25:55.3925580Z 2025-05-07T20:25:55.3925584Z 2025-05-07T20:25:55.3925588Z 2025-05-07T20:25:55.3925591Z 2025-05-07T20:25:55.3925595Z 2025-05-07T20:25:55.3925771Z  2025-05-07T20:25:55.3925978Z 2025-05-07T20:25:55.3925981Z 2025-05-07T20:25:55.3925985Z 2025-05-07T20:25:55.3925988Z 2025-05-07T20:25:55.3925992Z 2025-05-07T20:25:55.3925995Z 2025-05-07T20:25:55.3926174Z  2025-05-07T20:25:55.3926385Z 2025-05-07T20:25:55.3926394Z 2025-05-07T20:25:55.3926398Z 2025-05-07T20:25:55.3926401Z 2025-05-07T20:25:55.3926405Z 2025-05-07T20:25:55.3926408Z 2025-05-07T20:25:55.3926412Z 2025-05-07T20:25:55.3926843Z  2025-05-07T20:25:55.3927062Z 2025-05-07T20:25:55.3927066Z 2025-05-07T20:25:55.3927070Z 2025-05-07T20:25:55.3927073Z 2025-05-07T20:25:55.3927077Z 2025-05-07T20:25:55.3927080Z 2025-05-07T20:25:55.3927084Z 2025-05-07T20:25:55.3927087Z 2025-05-07T20:25:55.3927274Z  2025-05-07T20:25:55.3927492Z 2025-05-07T20:25:55.3927496Z 2025-05-07T20:25:55.3927500Z 2025-05-07T20:25:55.3927504Z 2025-05-07T20:25:55.3927507Z 2025-05-07T20:25:55.3927511Z 2025-05-07T20:25:55.3927514Z 2025-05-07T20:25:55.3927518Z 2025-05-07T20:25:55.3927521Z 2025-05-07T20:25:55.3927726Z  2025-05-07T20:25:55.3928093Z 2025-05-07T20:25:55.3928097Z 2025-05-07T20:25:55.3928101Z 2025-05-07T20:25:55.3928104Z 2025-05-07T20:25:55.3928108Z 2025-05-07T20:25:55.3928111Z 2025-05-07T20:25:55.3928122Z 2025-05-07T20:25:55.3928125Z 2025-05-07T20:25:55.3928137Z 2025-05-07T20:25:55.3928140Z 2025-05-07T20:25:55.3928326Z  2025-05-07T20:25:55.3928544Z 2025-05-07T20:25:55.3928548Z 2025-05-07T20:25:55.3928551Z 2025-05-07T20:25:55.3928561Z 2025-05-07T20:25:55.3928565Z 2025-05-07T20:25:55.3928568Z 2025-05-07T20:25:55.3928572Z 2025-05-07T20:25:55.3928575Z 2025-05-07T20:25:55.3928579Z 2025-05-07T20:25:55.3928582Z 2025-05-07T20:25:55.3928586Z 2025-05-07T20:25:55.3928783Z  2025-05-07T20:25:55.3929011Z 2025-05-07T20:25:55.3929015Z 2025-05-07T20:25:55.3929018Z 2025-05-07T20:25:55.3929022Z 2025-05-07T20:25:55.3929025Z 2025-05-07T20:25:55.3929029Z 2025-05-07T20:25:55.3929037Z 2025-05-07T20:25:55.3929040Z 2025-05-07T20:25:55.3929044Z 2025-05-07T20:25:55.3929047Z 2025-05-07T20:25:55.3929051Z 2025-05-07T20:25:55.3929054Z 2025-05-07T20:25:55.3929255Z  2025-05-07T20:25:55.3929477Z 2025-05-07T20:25:55.3929480Z 2025-05-07T20:25:55.3929484Z 2025-05-07T20:25:55.3929487Z 2025-05-07T20:25:55.3929491Z 2025-05-07T20:25:55.3929495Z 2025-05-07T20:25:55.3929498Z 2025-05-07T20:25:55.3929502Z 2025-05-07T20:25:55.3929505Z 2025-05-07T20:25:55.3929509Z 2025-05-07T20:25:55.3929512Z 2025-05-07T20:25:55.3929516Z 2025-05-07T20:25:55.3929519Z 2025-05-07T20:25:55.3929729Z  2025-05-07T20:25:55.3929952Z 2025-05-07T20:25:55.3929955Z 2025-05-07T20:25:55.3929959Z 2025-05-07T20:25:55.3929963Z 2025-05-07T20:25:55.3929966Z 2025-05-07T20:25:55.3929975Z 2025-05-07T20:25:55.3929979Z 2025-05-07T20:25:55.3929987Z 2025-05-07T20:25:55.3929991Z 2025-05-07T20:25:55.3929994Z 2025-05-07T20:25:55.3929998Z 2025-05-07T20:25:55.3930002Z 2025-05-07T20:25:55.3930005Z 2025-05-07T20:25:55.3930009Z 2025-05-07T20:25:55.3930212Z  2025-05-07T20:25:55.3930446Z 2025-05-07T20:25:55.3930450Z 2025-05-07T20:25:55.3930453Z 2025-05-07T20:25:55.3930457Z 2025-05-07T20:25:55.3930469Z 2025-05-07T20:25:55.3930473Z 2025-05-07T20:25:55.3930476Z 2025-05-07T20:25:55.3930480Z 2025-05-07T20:25:55.3930483Z 2025-05-07T20:25:55.3930487Z 2025-05-07T20:25:55.3930491Z 2025-05-07T20:25:55.3930494Z 2025-05-07T20:25:55.3930498Z 2025-05-07T20:25:55.3930501Z 2025-05-07T20:25:55.3930505Z 2025-05-07T20:25:55.3930719Z  2025-05-07T20:25:55.3930945Z 2025-05-07T20:25:55.3930948Z 2025-05-07T20:25:55.3930952Z 2025-05-07T20:25:55.3930956Z 2025-05-07T20:25:55.3930963Z 2025-05-07T20:25:55.3930967Z 2025-05-07T20:25:55.3930970Z 2025-05-07T20:25:55.3930974Z 2025-05-07T20:25:55.3930977Z 2025-05-07T20:25:55.3930981Z 2025-05-07T20:25:55.3930991Z 2025-05-07T20:25:55.3930995Z 2025-05-07T20:25:55.3931081Z 2025-05-07T20:25:55.3931086Z 2025-05-07T20:25:55.3931089Z 2025-05-07T20:25:55.3931093Z 2025-05-07T20:25:55.3931326Z  2025-05-07T20:25:55.3931563Z 2025-05-07T20:25:55.3931567Z 2025-05-07T20:25:55.3931571Z 2025-05-07T20:25:55.3931574Z 2025-05-07T20:25:55.3931578Z 2025-05-07T20:25:55.3931581Z 2025-05-07T20:25:55.3931585Z 2025-05-07T20:25:55.3931588Z 2025-05-07T20:25:55.3931592Z 2025-05-07T20:25:55.3931596Z 2025-05-07T20:25:55.3931599Z 2025-05-07T20:25:55.3931603Z 2025-05-07T20:25:55.3931606Z 2025-05-07T20:25:55.3931610Z 2025-05-07T20:25:55.3931613Z 2025-05-07T20:25:55.3931617Z 2025-05-07T20:25:55.3931621Z 2025-05-07T20:25:55.3931842Z  2025-05-07T20:25:55.3932268Z 2025-05-07T20:25:55.3932272Z 2025-05-07T20:25:55.3932275Z 2025-05-07T20:25:55.3932279Z 2025-05-07T20:25:55.3932288Z 2025-05-07T20:25:55.3932301Z 2025-05-07T20:25:55.3932305Z 2025-05-07T20:25:55.3932308Z 2025-05-07T20:25:55.3932312Z 2025-05-07T20:25:55.3932315Z 2025-05-07T20:25:55.3932327Z 2025-05-07T20:25:55.3932331Z 2025-05-07T20:25:55.3932334Z 2025-05-07T20:25:55.3932338Z 2025-05-07T20:25:55.3932341Z 2025-05-07T20:25:55.3932345Z 2025-05-07T20:25:55.3932348Z 2025-05-07T20:25:55.3932352Z 2025-05-07T20:25:55.3932575Z  2025-05-07T20:25:55.3932818Z 2025-05-07T20:25:55.3932822Z 2025-05-07T20:25:55.3932929Z  2025-05-07T20:25:55.3933032Z 2025-05-07T20:25:55.3933035Z 2025-05-07T20:25:55.3933396Z  2025-05-07T20:25:55.3933519Z 2025-05-07T20:25:55.3933539Z 2025-05-07T20:25:55.3933558Z 2025-05-07T20:25:55.3933782Z  2025-05-07T20:25:55.3933892Z 2025-05-07T20:25:55.3933898Z 2025-05-07T20:25:55.3933902Z 2025-05-07T20:25:55.3933906Z 2025-05-07T20:25:55.3934428Z  2025-05-07T20:25:55.3934591Z 2025-05-07T20:25:55.3934595Z 2025-05-07T20:25:55.3934600Z 2025-05-07T20:25:55.3934605Z 2025-05-07T20:25:55.3934621Z 2025-05-07T20:25:55.3934799Z  2025-05-07T20:25:55.3934924Z 2025-05-07T20:25:55.3934933Z 2025-05-07T20:25:55.3934936Z 2025-05-07T20:25:55.3934940Z 2025-05-07T20:25:55.3934944Z 2025-05-07T20:25:55.3934947Z 2025-05-07T20:25:55.3935289Z  2025-05-07T20:25:55.3935414Z 2025-05-07T20:25:55.3935421Z 2025-05-07T20:25:55.3935425Z 2025-05-07T20:25:55.3935430Z 2025-05-07T20:25:55.3935433Z 2025-05-07T20:25:55.3935437Z 2025-05-07T20:25:55.3935440Z 2025-05-07T20:25:55.3935892Z  2025-05-07T20:25:55.3936043Z 2025-05-07T20:25:55.3936048Z 2025-05-07T20:25:55.3936054Z 2025-05-07T20:25:55.3936079Z 2025-05-07T20:25:55.3936083Z 2025-05-07T20:25:55.3936089Z 2025-05-07T20:25:55.3936095Z 2025-05-07T20:25:55.3936109Z 2025-05-07T20:25:55.3936349Z  2025-05-07T20:25:55.3936503Z 2025-05-07T20:25:55.3936521Z 2025-05-07T20:25:55.3936530Z 2025-05-07T20:25:55.3936536Z 2025-05-07T20:25:55.3936542Z 2025-05-07T20:25:55.3936557Z 2025-05-07T20:25:55.3936563Z 2025-05-07T20:25:55.3936569Z 2025-05-07T20:25:55.3936575Z 2025-05-07T20:25:55.3936857Z  2025-05-07T20:25:55.3937020Z 2025-05-07T20:25:55.3937032Z 2025-05-07T20:25:55.3937036Z 2025-05-07T20:25:55.3937040Z 2025-05-07T20:25:55.3937051Z 2025-05-07T20:25:55.3937055Z 2025-05-07T20:25:55.3937060Z 2025-05-07T20:25:55.3937065Z 2025-05-07T20:25:55.3937070Z 2025-05-07T20:25:55.3937075Z 2025-05-07T20:25:55.3937358Z  2025-05-07T20:25:55.3937542Z 2025-05-07T20:25:55.3937548Z 2025-05-07T20:25:55.3937558Z 2025-05-07T20:25:55.3937571Z 2025-05-07T20:25:55.3937590Z 2025-05-07T20:25:55.3937595Z 2025-05-07T20:25:55.3937600Z 2025-05-07T20:25:55.3937605Z 2025-05-07T20:25:55.3937610Z 2025-05-07T20:25:55.3937614Z 2025-05-07T20:25:55.3937619Z 2025-05-07T20:25:55.3937972Z  2025-05-07T20:25:55.3938240Z 2025-05-07T20:25:55.3938251Z 2025-05-07T20:25:55.3938276Z 2025-05-07T20:25:55.3938284Z 2025-05-07T20:25:55.3938292Z 2025-05-07T20:25:55.3938300Z 2025-05-07T20:25:55.3938309Z 2025-05-07T20:25:55.3938317Z 2025-05-07T20:25:55.3938324Z 2025-05-07T20:25:55.3938332Z 2025-05-07T20:25:55.3938338Z 2025-05-07T20:25:55.3938344Z 2025-05-07T20:25:55.3938513Z  2025-05-07T20:25:55.3938705Z 2025-05-07T20:25:55.3938710Z 2025-05-07T20:25:55.3938714Z 2025-05-07T20:25:55.3938719Z 2025-05-07T20:25:55.3938724Z 2025-05-07T20:25:55.3938727Z 2025-05-07T20:25:55.3938731Z 2025-05-07T20:25:55.3938735Z 2025-05-07T20:25:55.3938738Z 2025-05-07T20:25:55.3938742Z 2025-05-07T20:25:55.3938745Z 2025-05-07T20:25:55.3938851Z 2025-05-07T20:25:55.3938855Z 2025-05-07T20:25:55.3939010Z  2025-05-07T20:25:55.3939197Z 2025-05-07T20:25:55.3939200Z 2025-05-07T20:25:55.3939204Z 2025-05-07T20:25:55.3939207Z 2025-05-07T20:25:55.3939218Z 2025-05-07T20:25:55.3939222Z 2025-05-07T20:25:55.3939225Z 2025-05-07T20:25:55.3939229Z 2025-05-07T20:25:55.3939232Z 2025-05-07T20:25:55.3939236Z 2025-05-07T20:25:55.3939246Z 2025-05-07T20:25:55.3939250Z 2025-05-07T20:25:55.3939254Z 2025-05-07T20:25:55.3939257Z 2025-05-07T20:25:55.3939407Z  2025-05-07T20:25:55.3939600Z 2025-05-07T20:25:55.3939603Z 2025-05-07T20:25:55.3939612Z 2025-05-07T20:25:55.3939616Z 2025-05-07T20:25:55.3939620Z 2025-05-07T20:25:55.3939623Z 2025-05-07T20:25:55.3939628Z 2025-05-07T20:25:55.3939631Z 2025-05-07T20:25:55.3939635Z 2025-05-07T20:25:55.3939638Z 2025-05-07T20:25:55.3939642Z 2025-05-07T20:25:55.3939645Z 2025-05-07T20:25:55.3939649Z 2025-05-07T20:25:55.3939652Z 2025-05-07T20:25:55.3939662Z 2025-05-07T20:25:55.3939816Z  2025-05-07T20:25:55.3940020Z 2025-05-07T20:25:55.3940024Z 2025-05-07T20:25:55.3940027Z 2025-05-07T20:25:55.3940031Z 2025-05-07T20:25:55.3940039Z 2025-05-07T20:25:55.3940042Z 2025-05-07T20:25:55.3940046Z 2025-05-07T20:25:55.3940049Z 2025-05-07T20:25:55.3940053Z 2025-05-07T20:25:55.3940058Z 2025-05-07T20:25:55.3940061Z 2025-05-07T20:25:55.3940065Z 2025-05-07T20:25:55.3940068Z 2025-05-07T20:25:55.3940072Z 2025-05-07T20:25:55.3940075Z 2025-05-07T20:25:55.3940084Z 2025-05-07T20:25:55.3940253Z  2025-05-07T20:25:55.3940494Z 2025-05-07T20:25:55.3940497Z 2025-05-07T20:25:55.3940501Z 2025-05-07T20:25:55.3940504Z 2025-05-07T20:25:55.3940515Z 2025-05-07T20:25:55.3940519Z 2025-05-07T20:25:55.3940523Z 2025-05-07T20:25:55.3940526Z 2025-05-07T20:25:55.3940530Z 2025-05-07T20:25:55.3940533Z 2025-05-07T20:25:55.3940537Z 2025-05-07T20:25:55.3940541Z 2025-05-07T20:25:55.3940547Z 2025-05-07T20:25:55.3940551Z 2025-05-07T20:25:55.3940555Z 2025-05-07T20:25:55.3940558Z 2025-05-07T20:25:55.3940562Z 2025-05-07T20:25:55.3940727Z  2025-05-07T20:25:55.3940945Z 2025-05-07T20:25:55.3940948Z 2025-05-07T20:25:55.3940952Z 2025-05-07T20:25:55.3940956Z 2025-05-07T20:25:55.3940959Z 2025-05-07T20:25:55.3940963Z 2025-05-07T20:25:55.3940966Z 2025-05-07T20:25:55.3940970Z 2025-05-07T20:25:55.3940973Z 2025-05-07T20:25:55.3940977Z 2025-05-07T20:25:55.3940980Z 2025-05-07T20:25:55.3940984Z 2025-05-07T20:25:55.3940987Z 2025-05-07T20:25:55.3940991Z 2025-05-07T20:25:55.3940994Z 2025-05-07T20:25:55.3940998Z 2025-05-07T20:25:55.3941008Z 2025-05-07T20:25:55.3941011Z 2025-05-07T20:25:55.3941475Z  2025-05-07T20:25:55.3941685Z 2025-05-07T20:25:55.3941694Z 2025-05-07T20:25:55.3941999Z  2025-05-07T20:25:55.3942125Z 2025-05-07T20:25:55.3942129Z 2025-05-07T20:25:55.3942353Z  2025-05-07T20:25:55.3942462Z 2025-05-07T20:25:55.3942470Z 2025-05-07T20:25:55.3942474Z 2025-05-07T20:25:55.3942807Z  2025-05-07T20:25:55.3942921Z 2025-05-07T20:25:55.3942924Z 2025-05-07T20:25:55.3942928Z 2025-05-07T20:25:55.3943086Z 2025-05-07T20:25:55.3943340Z  2025-05-07T20:25:55.3943480Z 2025-05-07T20:25:55.3943489Z 2025-05-07T20:25:55.3943496Z 2025-05-07T20:25:55.3943501Z 2025-05-07T20:25:55.3943507Z 2025-05-07T20:25:55.3943778Z  2025-05-07T20:25:55.3943926Z 2025-05-07T20:25:55.3943941Z 2025-05-07T20:25:55.3943947Z 2025-05-07T20:25:55.3943952Z 2025-05-07T20:25:55.3943957Z 2025-05-07T20:25:55.3943962Z 2025-05-07T20:25:55.3944174Z  2025-05-07T20:25:55.3944311Z 2025-05-07T20:25:55.3944324Z 2025-05-07T20:25:55.3944328Z 2025-05-07T20:25:55.3944332Z 2025-05-07T20:25:55.3944336Z 2025-05-07T20:25:55.3944339Z 2025-05-07T20:25:55.3944343Z 2025-05-07T20:25:55.3944706Z  2025-05-07T20:25:55.3944861Z 2025-05-07T20:25:55.3945076Z 2025-05-07T20:25:55.3945085Z 2025-05-07T20:25:55.3945116Z 2025-05-07T20:25:55.3945126Z 2025-05-07T20:25:55.3945136Z 2025-05-07T20:25:55.3945145Z 2025-05-07T20:25:55.3945152Z 2025-05-07T20:25:55.3945356Z  2025-05-07T20:25:55.3945542Z 2025-05-07T20:25:55.3945555Z 2025-05-07T20:25:55.3945561Z 2025-05-07T20:25:55.3945566Z 2025-05-07T20:25:55.3945572Z 2025-05-07T20:25:55.3945578Z 2025-05-07T20:25:55.3945581Z 2025-05-07T20:25:55.3945585Z 2025-05-07T20:25:55.3945588Z 2025-05-07T20:25:55.3945718Z  2025-05-07T20:25:55.3945879Z 2025-05-07T20:25:55.3945883Z 2025-05-07T20:25:55.3945886Z 2025-05-07T20:25:55.3945890Z 2025-05-07T20:25:55.3945893Z 2025-05-07T20:25:55.3945897Z 2025-05-07T20:25:55.3945900Z 2025-05-07T20:25:55.3945904Z 2025-05-07T20:25:55.3945907Z 2025-05-07T20:25:55.3945913Z 2025-05-07T20:25:55.3946187Z  2025-05-07T20:25:55.3946360Z 2025-05-07T20:25:55.3946364Z 2025-05-07T20:25:55.3946383Z 2025-05-07T20:25:55.3946386Z 2025-05-07T20:25:55.3946390Z 2025-05-07T20:25:55.3946393Z 2025-05-07T20:25:55.3946397Z 2025-05-07T20:25:55.3946400Z 2025-05-07T20:25:55.3946404Z 2025-05-07T20:25:55.3946407Z 2025-05-07T20:25:55.3946419Z 2025-05-07T20:25:55.3946639Z  2025-05-07T20:25:55.3946818Z 2025-05-07T20:25:55.3946822Z 2025-05-07T20:25:55.3946829Z 2025-05-07T20:25:55.3946833Z 2025-05-07T20:25:55.3946837Z 2025-05-07T20:25:55.3946840Z 2025-05-07T20:25:55.3946844Z 2025-05-07T20:25:55.3946847Z 2025-05-07T20:25:55.3946851Z 2025-05-07T20:25:55.3946854Z 2025-05-07T20:25:55.3946858Z 2025-05-07T20:25:55.3946862Z 2025-05-07T20:25:55.3947074Z  2025-05-07T20:25:55.3947259Z 2025-05-07T20:25:55.3947263Z 2025-05-07T20:25:55.3947266Z 2025-05-07T20:25:55.3947270Z 2025-05-07T20:25:55.3947274Z 2025-05-07T20:25:55.3947277Z 2025-05-07T20:25:55.3947281Z 2025-05-07T20:25:55.3947284Z 2025-05-07T20:25:55.3947288Z 2025-05-07T20:25:55.3947303Z 2025-05-07T20:25:55.3947307Z 2025-05-07T20:25:55.3947310Z 2025-05-07T20:25:55.3947314Z 2025-05-07T20:25:55.3947505Z  2025-05-07T20:25:55.3947691Z 2025-05-07T20:25:55.3947705Z 2025-05-07T20:25:55.3947716Z 2025-05-07T20:25:55.3947720Z 2025-05-07T20:25:55.3947723Z 2025-05-07T20:25:55.3947727Z 2025-05-07T20:25:55.3947730Z 2025-05-07T20:25:55.3947734Z 2025-05-07T20:25:55.3947737Z 2025-05-07T20:25:55.3947741Z 2025-05-07T20:25:55.3947744Z 2025-05-07T20:25:55.3947748Z 2025-05-07T20:25:55.3947751Z 2025-05-07T20:25:55.3947755Z 2025-05-07T20:25:55.3948054Z  2025-05-07T20:25:55.3948260Z 2025-05-07T20:25:55.3948265Z 2025-05-07T20:25:55.3948281Z 2025-05-07T20:25:55.3948286Z 2025-05-07T20:25:55.3948291Z 2025-05-07T20:25:55.3948295Z 2025-05-07T20:25:55.3948301Z 2025-05-07T20:25:55.3948305Z 2025-05-07T20:25:55.3948310Z 2025-05-07T20:25:55.3948314Z 2025-05-07T20:25:55.3948319Z 2025-05-07T20:25:55.3948337Z 2025-05-07T20:25:55.3948349Z 2025-05-07T20:25:55.3948353Z 2025-05-07T20:25:55.3948358Z 2025-05-07T20:25:55.3948507Z  2025-05-07T20:25:55.3948706Z 2025-05-07T20:25:55.3948709Z 2025-05-07T20:25:55.3948835Z 2025-05-07T20:25:55.3948840Z 2025-05-07T20:25:55.3948850Z 2025-05-07T20:25:55.3948854Z 2025-05-07T20:25:55.3948858Z 2025-05-07T20:25:55.3948861Z 2025-05-07T20:25:55.3948865Z 2025-05-07T20:25:55.3948870Z 2025-05-07T20:25:55.3948873Z 2025-05-07T20:25:55.3948877Z 2025-05-07T20:25:55.3948880Z 2025-05-07T20:25:55.3948884Z 2025-05-07T20:25:55.3948888Z 2025-05-07T20:25:55.3948891Z 2025-05-07T20:25:55.3949061Z  2025-05-07T20:25:55.3949273Z 2025-05-07T20:25:55.3949277Z 2025-05-07T20:25:55.3949280Z 2025-05-07T20:25:55.3949284Z 2025-05-07T20:25:55.3949288Z 2025-05-07T20:25:55.3949291Z 2025-05-07T20:25:55.3949295Z 2025-05-07T20:25:55.3949299Z 2025-05-07T20:25:55.3949302Z 2025-05-07T20:25:55.3949397Z 2025-05-07T20:25:55.3949400Z 2025-05-07T20:25:55.3949404Z 2025-05-07T20:25:55.3949407Z 2025-05-07T20:25:55.3949411Z 2025-05-07T20:25:55.3949425Z 2025-05-07T20:25:55.3949435Z 2025-05-07T20:25:55.3949438Z 2025-05-07T20:25:55.3949600Z  2025-05-07T20:25:55.3949808Z 2025-05-07T20:25:55.3949811Z 2025-05-07T20:25:55.3949815Z 2025-05-07T20:25:55.3949819Z 2025-05-07T20:25:55.3949822Z 2025-05-07T20:25:55.3949832Z 2025-05-07T20:25:55.3949835Z 2025-05-07T20:25:55.3949839Z 2025-05-07T20:25:55.3949843Z 2025-05-07T20:25:55.3949846Z 2025-05-07T20:25:55.3949850Z 2025-05-07T20:25:55.3949853Z 2025-05-07T20:25:55.3949857Z 2025-05-07T20:25:55.3949861Z 2025-05-07T20:25:55.3949864Z 2025-05-07T20:25:55.3949868Z 2025-05-07T20:25:55.3949871Z 2025-05-07T20:25:55.3949875Z 2025-05-07T20:25:55.3950259Z  2025-05-07T20:25:55.3950548Z 2025-05-07T20:25:55.3950553Z 2025-05-07T20:25:55.3950701Z  2025-05-07T20:25:55.3950855Z 2025-05-07T20:25:55.3950859Z 2025-05-07T20:25:55.3950970Z  2025-05-07T20:25:55.3951077Z 2025-05-07T20:25:55.3951080Z 2025-05-07T20:25:55.3951084Z 2025-05-07T20:25:55.3951283Z  2025-05-07T20:25:55.3951400Z 2025-05-07T20:25:55.3951404Z 2025-05-07T20:25:55.3951409Z 2025-05-07T20:25:55.3951417Z 2025-05-07T20:25:55.3951763Z  2025-05-07T20:25:55.3951898Z 2025-05-07T20:25:55.3951907Z 2025-05-07T20:25:55.3951912Z 2025-05-07T20:25:55.3951918Z 2025-05-07T20:25:55.3951924Z 2025-05-07T20:25:55.3952082Z  2025-05-07T20:25:55.3952213Z 2025-05-07T20:25:55.3952224Z 2025-05-07T20:25:55.3952230Z 2025-05-07T20:25:55.3952235Z 2025-05-07T20:25:55.3952241Z 2025-05-07T20:25:55.3952247Z 2025-05-07T20:25:55.3952543Z  2025-05-07T20:25:55.3952685Z 2025-05-07T20:25:55.3952691Z 2025-05-07T20:25:55.3952697Z 2025-05-07T20:25:55.3952704Z 2025-05-07T20:25:55.3952710Z 2025-05-07T20:25:55.3952715Z 2025-05-07T20:25:55.3952720Z 2025-05-07T20:25:55.3952901Z  2025-05-07T20:25:55.3953048Z 2025-05-07T20:25:55.3953055Z 2025-05-07T20:25:55.3953058Z 2025-05-07T20:25:55.3953063Z 2025-05-07T20:25:55.3953067Z 2025-05-07T20:25:55.3953071Z 2025-05-07T20:25:55.3953086Z 2025-05-07T20:25:55.3953090Z 2025-05-07T20:25:55.3953296Z  2025-05-07T20:25:55.3953452Z 2025-05-07T20:25:55.3953462Z 2025-05-07T20:25:55.3953465Z 2025-05-07T20:25:55.3953469Z 2025-05-07T20:25:55.3953472Z 2025-05-07T20:25:55.3953476Z 2025-05-07T20:25:55.3953479Z 2025-05-07T20:25:55.3953483Z 2025-05-07T20:25:55.3953486Z 2025-05-07T20:25:55.3953779Z  2025-05-07T20:25:55.3953943Z 2025-05-07T20:25:55.3953959Z 2025-05-07T20:25:55.3953965Z 2025-05-07T20:25:55.3953970Z 2025-05-07T20:25:55.3953975Z 2025-05-07T20:25:55.3953980Z 2025-05-07T20:25:55.3953985Z 2025-05-07T20:25:55.3953997Z 2025-05-07T20:25:55.3954003Z 2025-05-07T20:25:55.3954008Z 2025-05-07T20:25:55.3954140Z  2025-05-07T20:25:55.3954313Z 2025-05-07T20:25:55.3954319Z 2025-05-07T20:25:55.3954324Z 2025-05-07T20:25:55.3954328Z 2025-05-07T20:25:55.3954333Z 2025-05-07T20:25:55.3954349Z 2025-05-07T20:25:55.3954353Z 2025-05-07T20:25:55.3954358Z 2025-05-07T20:25:55.3954470Z 2025-05-07T20:25:55.3954475Z 2025-05-07T20:25:55.3954478Z 2025-05-07T20:25:55.3954616Z  2025-05-07T20:25:55.3954804Z 2025-05-07T20:25:55.3954819Z 2025-05-07T20:25:55.3954822Z 2025-05-07T20:25:55.3954826Z 2025-05-07T20:25:55.3954829Z 2025-05-07T20:25:55.3954833Z 2025-05-07T20:25:55.3954837Z 2025-05-07T20:25:55.3954840Z 2025-05-07T20:25:55.3954844Z 2025-05-07T20:25:55.3954847Z 2025-05-07T20:25:55.3954851Z 2025-05-07T20:25:55.3954855Z 2025-05-07T20:25:55.3954987Z  2025-05-07T20:25:55.3955173Z 2025-05-07T20:25:55.3955177Z 2025-05-07T20:25:55.3955181Z 2025-05-07T20:25:55.3955184Z 2025-05-07T20:25:55.3955188Z 2025-05-07T20:25:55.3955191Z 2025-05-07T20:25:55.3955195Z 2025-05-07T20:25:55.3955280Z 2025-05-07T20:25:55.3955284Z 2025-05-07T20:25:55.3955287Z 2025-05-07T20:25:55.3955291Z 2025-05-07T20:25:55.3955294Z 2025-05-07T20:25:55.3955307Z 2025-05-07T20:25:55.3955455Z  2025-05-07T20:25:55.3955642Z 2025-05-07T20:25:55.3955645Z 2025-05-07T20:25:55.3955649Z 2025-05-07T20:25:55.3955652Z 2025-05-07T20:25:55.3955656Z 2025-05-07T20:25:55.3955659Z 2025-05-07T20:25:55.3955663Z 2025-05-07T20:25:55.3955668Z 2025-05-07T20:25:55.3955671Z 2025-05-07T20:25:55.3955675Z 2025-05-07T20:25:55.3955684Z 2025-05-07T20:25:55.3955688Z 2025-05-07T20:25:55.3955691Z 2025-05-07T20:25:55.3955695Z 2025-05-07T20:25:55.3955841Z  2025-05-07T20:25:55.3956034Z 2025-05-07T20:25:55.3956037Z 2025-05-07T20:25:55.3956046Z 2025-05-07T20:25:55.3956050Z 2025-05-07T20:25:55.3956054Z 2025-05-07T20:25:55.3956057Z 2025-05-07T20:25:55.3956061Z 2025-05-07T20:25:55.3956064Z 2025-05-07T20:25:55.3956068Z 2025-05-07T20:25:55.3956077Z 2025-05-07T20:25:55.3956080Z 2025-05-07T20:25:55.3956084Z 2025-05-07T20:25:55.3956087Z 2025-05-07T20:25:55.3956091Z 2025-05-07T20:25:55.3956094Z 2025-05-07T20:25:55.3956250Z  2025-05-07T20:25:55.3956458Z 2025-05-07T20:25:55.3956462Z 2025-05-07T20:25:55.3956466Z 2025-05-07T20:25:55.3956469Z 2025-05-07T20:25:55.3956473Z 2025-05-07T20:25:55.3956476Z 2025-05-07T20:25:55.3956480Z 2025-05-07T20:25:55.3956483Z 2025-05-07T20:25:55.3956487Z 2025-05-07T20:25:55.3956490Z 2025-05-07T20:25:55.3956494Z 2025-05-07T20:25:55.3956497Z 2025-05-07T20:25:55.3956501Z 2025-05-07T20:25:55.3956504Z 2025-05-07T20:25:55.3956508Z 2025-05-07T20:25:55.3956516Z 2025-05-07T20:25:55.3956675Z  2025-05-07T20:25:55.3956875Z 2025-05-07T20:25:55.3956879Z 2025-05-07T20:25:55.3956883Z 2025-05-07T20:25:55.3956886Z 2025-05-07T20:25:55.3956895Z 2025-05-07T20:25:55.3956899Z 2025-05-07T20:25:55.3956902Z 2025-05-07T20:25:55.3956910Z 2025-05-07T20:25:55.3956914Z 2025-05-07T20:25:55.3956917Z 2025-05-07T20:25:55.3956921Z 2025-05-07T20:25:55.3956925Z 2025-05-07T20:25:55.3956928Z 2025-05-07T20:25:55.3956932Z 2025-05-07T20:25:55.3956935Z 2025-05-07T20:25:55.3956942Z 2025-05-07T20:25:55.3956946Z 2025-05-07T20:25:55.3957110Z  2025-05-07T20:25:55.3957324Z 2025-05-07T20:25:55.3957328Z 2025-05-07T20:25:55.3957332Z 2025-05-07T20:25:55.3957335Z 2025-05-07T20:25:55.3957339Z 2025-05-07T20:25:55.3957342Z 2025-05-07T20:25:55.3957346Z 2025-05-07T20:25:55.3957349Z 2025-05-07T20:25:55.3957353Z 2025-05-07T20:25:55.3957357Z 2025-05-07T20:25:55.3957360Z 2025-05-07T20:25:55.3957364Z 2025-05-07T20:25:55.3957367Z 2025-05-07T20:25:55.3957371Z 2025-05-07T20:25:55.3957374Z 2025-05-07T20:25:55.3957378Z 2025-05-07T20:25:55.3957389Z 2025-05-07T20:25:55.3957392Z 2025-05-07T20:25:55.3957562Z  2025-05-07T20:25:55.3957768Z 2025-05-07T20:25:55.3957775Z 2025-05-07T20:25:55.3957883Z  2025-05-07T20:25:55.3957985Z 2025-05-07T20:25:55.3957989Z 2025-05-07T20:25:55.3958266Z  2025-05-07T20:25:55.3958384Z 2025-05-07T20:25:55.3958392Z 2025-05-07T20:25:55.3958536Z 2025-05-07T20:25:55.3958647Z  2025-05-07T20:25:55.3958763Z 2025-05-07T20:25:55.3958768Z 2025-05-07T20:25:55.3958775Z 2025-05-07T20:25:55.3958778Z 2025-05-07T20:25:55.3958937Z  2025-05-07T20:25:55.3959061Z 2025-05-07T20:25:55.3959066Z 2025-05-07T20:25:55.3959070Z 2025-05-07T20:25:55.3959074Z 2025-05-07T20:25:55.3959077Z 2025-05-07T20:25:55.3959293Z  2025-05-07T20:25:55.3959423Z 2025-05-07T20:25:55.3959432Z 2025-05-07T20:25:55.3959436Z 2025-05-07T20:25:55.3959439Z 2025-05-07T20:25:55.3959443Z 2025-05-07T20:25:55.3959446Z 2025-05-07T20:25:55.3959839Z  2025-05-07T20:25:55.3959993Z 2025-05-07T20:25:55.3959999Z 2025-05-07T20:25:55.3960005Z 2025-05-07T20:25:55.3960011Z 2025-05-07T20:25:55.3960025Z 2025-05-07T20:25:55.3960588Z 2025-05-07T20:25:55.3960609Z 2025-05-07T20:25:55.3960741Z  2025-05-07T20:25:55.3960886Z 2025-05-07T20:25:55.3960891Z 2025-05-07T20:25:55.3960903Z 2025-05-07T20:25:55.3960907Z 2025-05-07T20:25:55.3960926Z 2025-05-07T20:25:55.3960930Z 2025-05-07T20:25:55.3960934Z 2025-05-07T20:25:55.3960939Z 2025-05-07T20:25:55.3961060Z  2025-05-07T20:25:55.3961214Z 2025-05-07T20:25:55.3961217Z 2025-05-07T20:25:55.3961228Z 2025-05-07T20:25:55.3961231Z 2025-05-07T20:25:55.3961235Z 2025-05-07T20:25:55.3961239Z 2025-05-07T20:25:55.3961242Z 2025-05-07T20:25:55.3961246Z 2025-05-07T20:25:55.3961249Z 2025-05-07T20:25:55.3961366Z  2025-05-07T20:25:55.3961526Z 2025-05-07T20:25:55.3961530Z 2025-05-07T20:25:55.3961549Z 2025-05-07T20:25:55.3961552Z 2025-05-07T20:25:55.3961556Z 2025-05-07T20:25:55.3961559Z 2025-05-07T20:25:55.3961563Z 2025-05-07T20:25:55.3961566Z 2025-05-07T20:25:55.3961570Z 2025-05-07T20:25:55.3961579Z 2025-05-07T20:25:55.3961705Z  2025-05-07T20:25:55.3961868Z 2025-05-07T20:25:55.3961872Z 2025-05-07T20:25:55.3961876Z 2025-05-07T20:25:55.3961879Z 2025-05-07T20:25:55.3961883Z 2025-05-07T20:25:55.3961892Z 2025-05-07T20:25:55.3961895Z 2025-05-07T20:25:55.3961899Z 2025-05-07T20:25:55.3961902Z 2025-05-07T20:25:55.3961906Z 2025-05-07T20:25:55.3961909Z 2025-05-07T20:25:55.3962050Z  2025-05-07T20:25:55.3962220Z 2025-05-07T20:25:55.3962224Z 2025-05-07T20:25:55.3962228Z 2025-05-07T20:25:55.3962231Z 2025-05-07T20:25:55.3962235Z 2025-05-07T20:25:55.3962238Z 2025-05-07T20:25:55.3962242Z 2025-05-07T20:25:55.3962246Z 2025-05-07T20:25:55.3962249Z 2025-05-07T20:25:55.3962253Z 2025-05-07T20:25:55.3962256Z 2025-05-07T20:25:55.3962261Z 2025-05-07T20:25:55.3962409Z  2025-05-07T20:25:55.3962591Z 2025-05-07T20:25:55.3962595Z 2025-05-07T20:25:55.3962598Z 2025-05-07T20:25:55.3962602Z 2025-05-07T20:25:55.3962610Z 2025-05-07T20:25:55.3962620Z 2025-05-07T20:25:55.3962623Z 2025-05-07T20:25:55.3962627Z 2025-05-07T20:25:55.3962630Z 2025-05-07T20:25:55.3962634Z 2025-05-07T20:25:55.3962638Z 2025-05-07T20:25:55.3962641Z 2025-05-07T20:25:55.3962650Z 2025-05-07T20:25:55.3962788Z  2025-05-07T20:25:55.3962980Z 2025-05-07T20:25:55.3962983Z 2025-05-07T20:25:55.3962987Z 2025-05-07T20:25:55.3962991Z 2025-05-07T20:25:55.3962994Z 2025-05-07T20:25:55.3962998Z 2025-05-07T20:25:55.3963001Z 2025-05-07T20:25:55.3963005Z 2025-05-07T20:25:55.3963008Z 2025-05-07T20:25:55.3963012Z 2025-05-07T20:25:55.3963015Z 2025-05-07T20:25:55.3963019Z 2025-05-07T20:25:55.3963022Z 2025-05-07T20:25:55.3963026Z 2025-05-07T20:25:55.3963165Z  2025-05-07T20:25:55.3963365Z 2025-05-07T20:25:55.3963369Z 2025-05-07T20:25:55.3963373Z 2025-05-07T20:25:55.3963376Z 2025-05-07T20:25:55.3963380Z 2025-05-07T20:25:55.3963384Z 2025-05-07T20:25:55.3963923Z 2025-05-07T20:25:55.3963934Z 2025-05-07T20:25:55.3963958Z 2025-05-07T20:25:55.3963978Z 2025-05-07T20:25:55.3963996Z 2025-05-07T20:25:55.3964068Z 2025-05-07T20:25:55.3964080Z 2025-05-07T20:25:55.3964087Z 2025-05-07T20:25:55.3964093Z 2025-05-07T20:25:55.3964941Z  2025-05-07T20:25:55.3965251Z 2025-05-07T20:25:55.3965282Z 2025-05-07T20:25:55.3965289Z 2025-05-07T20:25:55.3965297Z 2025-05-07T20:25:55.3965302Z 2025-05-07T20:25:55.3965310Z 2025-05-07T20:25:55.3965314Z 2025-05-07T20:25:55.3965319Z 2025-05-07T20:25:55.3965325Z 2025-05-07T20:25:55.3965330Z 2025-05-07T20:25:55.3965335Z 2025-05-07T20:25:55.3965340Z 2025-05-07T20:25:55.3965346Z 2025-05-07T20:25:55.3965352Z 2025-05-07T20:25:55.3965359Z 2025-05-07T20:25:55.3965375Z 2025-05-07T20:25:55.3965592Z  2025-05-07T20:25:55.3965878Z 2025-05-07T20:25:55.3965885Z 2025-05-07T20:25:55.3965892Z 2025-05-07T20:25:55.3965898Z 2025-05-07T20:25:55.3965903Z 2025-05-07T20:25:55.3966031Z 2025-05-07T20:25:55.3966036Z 2025-05-07T20:25:55.3966052Z 2025-05-07T20:25:55.3966057Z 2025-05-07T20:25:55.3966061Z 2025-05-07T20:25:55.3966066Z 2025-05-07T20:25:55.3966071Z 2025-05-07T20:25:55.3966075Z 2025-05-07T20:25:55.3966093Z 2025-05-07T20:25:55.3966098Z 2025-05-07T20:25:55.3966103Z 2025-05-07T20:25:55.3966108Z 2025-05-07T20:25:55.3966327Z  2025-05-07T20:25:55.3966619Z 2025-05-07T20:25:55.3966624Z 2025-05-07T20:25:55.3966630Z 2025-05-07T20:25:55.3966635Z 2025-05-07T20:25:55.3966640Z 2025-05-07T20:25:55.3966645Z 2025-05-07T20:25:55.3966650Z 2025-05-07T20:25:55.3966655Z 2025-05-07T20:25:55.3966660Z 2025-05-07T20:25:55.3966664Z 2025-05-07T20:25:55.3966669Z 2025-05-07T20:25:55.3966674Z 2025-05-07T20:25:55.3966679Z 2025-05-07T20:25:55.3966684Z 2025-05-07T20:25:55.3966688Z 2025-05-07T20:25:55.3966693Z 2025-05-07T20:25:55.3966698Z 2025-05-07T20:25:55.3966703Z 2025-05-07T20:25:55.3966929Z  2025-05-07T20:25:55.3967227Z 2025-05-07T20:25:55.3967232Z 2025-05-07T20:25:55.3967371Z  2025-05-07T20:25:55.3967507Z 2025-05-07T20:25:55.3967512Z 2025-05-07T20:25:55.3967645Z  2025-05-07T20:25:55.3967801Z 2025-05-07T20:25:55.3967806Z 2025-05-07T20:25:55.3967811Z 2025-05-07T20:25:55.3967948Z  2025-05-07T20:25:55.3968095Z 2025-05-07T20:25:55.3968100Z 2025-05-07T20:25:55.3968114Z 2025-05-07T20:25:55.3968120Z 2025-05-07T20:25:55.3968262Z  2025-05-07T20:25:55.3968415Z 2025-05-07T20:25:55.3968420Z 2025-05-07T20:25:55.3968426Z 2025-05-07T20:25:55.3968430Z 2025-05-07T20:25:55.3968436Z 2025-05-07T20:25:55.3968590Z  2025-05-07T20:25:55.3968754Z 2025-05-07T20:25:55.3968759Z 2025-05-07T20:25:55.3968764Z 2025-05-07T20:25:55.3968769Z 2025-05-07T20:25:55.3968774Z 2025-05-07T20:25:55.3968779Z 2025-05-07T20:25:55.3968940Z  2025-05-07T20:25:55.3969107Z 2025-05-07T20:25:55.3969112Z 2025-05-07T20:25:55.3969124Z 2025-05-07T20:25:55.3969129Z 2025-05-07T20:25:55.3969134Z 2025-05-07T20:25:55.3969139Z 2025-05-07T20:25:55.3969144Z 2025-05-07T20:25:55.3969301Z  2025-05-07T20:25:55.3969488Z 2025-05-07T20:25:55.3969494Z 2025-05-07T20:25:55.3969507Z 2025-05-07T20:25:55.3969512Z 2025-05-07T20:25:55.3969516Z 2025-05-07T20:25:55.3969521Z 2025-05-07T20:25:55.3969526Z 2025-05-07T20:25:55.3969531Z 2025-05-07T20:25:55.3969741Z  2025-05-07T20:25:55.3969951Z 2025-05-07T20:25:55.3969957Z 2025-05-07T20:25:55.3969962Z 2025-05-07T20:25:55.3969967Z 2025-05-07T20:25:55.3969972Z 2025-05-07T20:25:55.3969976Z 2025-05-07T20:25:55.3969981Z 2025-05-07T20:25:55.3969986Z 2025-05-07T20:25:55.3969991Z 2025-05-07T20:25:55.3970153Z  2025-05-07T20:25:55.3970371Z 2025-05-07T20:25:55.3970376Z 2025-05-07T20:25:55.3970381Z 2025-05-07T20:25:55.3970386Z 2025-05-07T20:25:55.3970391Z 2025-05-07T20:25:55.3970396Z 2025-05-07T20:25:55.3970401Z 2025-05-07T20:25:55.3970414Z 2025-05-07T20:25:55.3970419Z 2025-05-07T20:25:55.3970425Z 2025-05-07T20:25:55.3970597Z  2025-05-07T20:25:55.3970827Z 2025-05-07T20:25:55.3970832Z 2025-05-07T20:25:55.3970838Z 2025-05-07T20:25:55.3970955Z 2025-05-07T20:25:55.3970961Z 2025-05-07T20:25:55.3970966Z 2025-05-07T20:25:55.3970971Z 2025-05-07T20:25:55.3970976Z 2025-05-07T20:25:55.3970981Z 2025-05-07T20:25:55.3970986Z 2025-05-07T20:25:55.3970991Z 2025-05-07T20:25:55.3971189Z  2025-05-07T20:25:55.3971430Z 2025-05-07T20:25:55.3971435Z 2025-05-07T20:25:55.3971440Z 2025-05-07T20:25:55.3971444Z 2025-05-07T20:25:55.3971449Z 2025-05-07T20:25:55.3971454Z 2025-05-07T20:25:55.3971458Z 2025-05-07T20:25:55.3971463Z 2025-05-07T20:25:55.3971468Z 2025-05-07T20:25:55.3971473Z 2025-05-07T20:25:55.3971478Z 2025-05-07T20:25:55.3971483Z 2025-05-07T20:25:55.3971670Z  2025-05-07T20:25:55.3971917Z 2025-05-07T20:25:55.3971922Z 2025-05-07T20:25:55.3972142Z 2025-05-07T20:25:55.3972147Z 2025-05-07T20:25:55.3972153Z 2025-05-07T20:25:55.3972157Z 2025-05-07T20:25:55.3972163Z 2025-05-07T20:25:55.3972168Z 2025-05-07T20:25:55.3972173Z 2025-05-07T20:25:55.3972178Z 2025-05-07T20:25:55.3972203Z 2025-05-07T20:25:55.3972208Z 2025-05-07T20:25:55.3972213Z 2025-05-07T20:25:55.3972411Z  2025-05-07T20:25:55.3972668Z 2025-05-07T20:25:55.3972674Z 2025-05-07T20:25:55.3972678Z 2025-05-07T20:25:55.3972683Z 2025-05-07T20:25:55.3972695Z 2025-05-07T20:25:55.3972700Z 2025-05-07T20:25:55.3972705Z 2025-05-07T20:25:55.3972711Z 2025-05-07T20:25:55.3972716Z 2025-05-07T20:25:55.3972721Z 2025-05-07T20:25:55.3972726Z 2025-05-07T20:25:55.3972731Z 2025-05-07T20:25:55.3972736Z 2025-05-07T20:25:55.3972741Z 2025-05-07T20:25:55.3972928Z  2025-05-07T20:25:55.3973195Z 2025-05-07T20:25:55.3973200Z 2025-05-07T20:25:55.3973205Z 2025-05-07T20:25:55.3973210Z 2025-05-07T20:25:55.3973224Z 2025-05-07T20:25:55.3973229Z 2025-05-07T20:25:55.3973234Z 2025-05-07T20:25:55.3973239Z 2025-05-07T20:25:55.3973244Z 2025-05-07T20:25:55.3973249Z 2025-05-07T20:25:55.3973254Z 2025-05-07T20:25:55.3973260Z 2025-05-07T20:25:55.3973270Z 2025-05-07T20:25:55.3973275Z 2025-05-07T20:25:55.3973280Z 2025-05-07T20:25:55.3973482Z  2025-05-07T20:25:55.3973749Z 2025-05-07T20:25:55.3973754Z 2025-05-07T20:25:55.3973759Z 2025-05-07T20:25:55.3973764Z 2025-05-07T20:25:55.3973769Z 2025-05-07T20:25:55.3973774Z 2025-05-07T20:25:55.3973779Z 2025-05-07T20:25:55.3973783Z 2025-05-07T20:25:55.3973788Z 2025-05-07T20:25:55.3973793Z 2025-05-07T20:25:55.3973798Z 2025-05-07T20:25:55.3973803Z 2025-05-07T20:25:55.3973808Z 2025-05-07T20:25:55.3973822Z 2025-05-07T20:25:55.3973827Z 2025-05-07T20:25:55.3973832Z 2025-05-07T20:25:55.3974031Z  2025-05-07T20:25:55.3974306Z 2025-05-07T20:25:55.3974311Z 2025-05-07T20:25:55.3974322Z 2025-05-07T20:25:55.3974327Z 2025-05-07T20:25:55.3974341Z 2025-05-07T20:25:55.3974346Z 2025-05-07T20:25:55.3974352Z 2025-05-07T20:25:55.3974357Z 2025-05-07T20:25:55.3974362Z 2025-05-07T20:25:55.3974367Z 2025-05-07T20:25:55.3974377Z 2025-05-07T20:25:55.3974382Z 2025-05-07T20:25:55.3974388Z 2025-05-07T20:25:55.3974394Z 2025-05-07T20:25:55.3974399Z 2025-05-07T20:25:55.3974404Z 2025-05-07T20:25:55.3974409Z 2025-05-07T20:25:55.3974614Z  2025-05-07T20:25:55.3974935Z 2025-05-07T20:25:55.3974940Z 2025-05-07T20:25:55.3974947Z 2025-05-07T20:25:55.3974953Z 2025-05-07T20:25:55.3974959Z 2025-05-07T20:25:55.3974966Z 2025-05-07T20:25:55.3974972Z 2025-05-07T20:25:55.3974978Z 2025-05-07T20:25:55.3974985Z 2025-05-07T20:25:55.3974992Z 2025-05-07T20:25:55.3974998Z 2025-05-07T20:25:55.3975004Z 2025-05-07T20:25:55.3975011Z 2025-05-07T20:25:55.3975017Z 2025-05-07T20:25:55.3975023Z 2025-05-07T20:25:55.3975030Z 2025-05-07T20:25:55.3975036Z 2025-05-07T20:25:55.3975051Z 2025-05-07T20:25:55.3975289Z  2025-05-07T20:25:55.3975579Z 2025-05-07T20:25:55.3975584Z 2025-05-07T20:25:55.3975726Z  2025-05-07T20:25:55.3975868Z 2025-05-07T20:25:55.3975963Z 2025-05-07T20:25:55.3976078Z  2025-05-07T20:25:55.3976192Z 2025-05-07T20:25:55.3976196Z 2025-05-07T20:25:55.3976200Z 2025-05-07T20:25:55.3976304Z  2025-05-07T20:25:55.3976417Z 2025-05-07T20:25:55.3976421Z 2025-05-07T20:25:55.3976424Z 2025-05-07T20:25:55.3976428Z 2025-05-07T20:25:55.3976532Z  2025-05-07T20:25:55.3976647Z 2025-05-07T20:25:55.3976651Z 2025-05-07T20:25:55.3976654Z 2025-05-07T20:25:55.3976658Z 2025-05-07T20:25:55.3976667Z 2025-05-07T20:25:55.3976804Z  2025-05-07T20:25:55.3976934Z 2025-05-07T20:25:55.3976938Z 2025-05-07T20:25:55.3976941Z 2025-05-07T20:25:55.3976945Z 2025-05-07T20:25:55.3976948Z 2025-05-07T20:25:55.3976952Z 2025-05-07T20:25:55.3977060Z  2025-05-07T20:25:55.3977280Z 2025-05-07T20:25:55.3977284Z 2025-05-07T20:25:55.3977287Z 2025-05-07T20:25:55.3977291Z 2025-05-07T20:25:55.3977295Z 2025-05-07T20:25:55.3977298Z 2025-05-07T20:25:55.3977302Z 2025-05-07T20:25:55.3977422Z  2025-05-07T20:25:55.3977566Z 2025-05-07T20:25:55.3977569Z 2025-05-07T20:25:55.3977573Z 2025-05-07T20:25:55.3977577Z 2025-05-07T20:25:55.3977580Z 2025-05-07T20:25:55.3977584Z 2025-05-07T20:25:55.3977587Z 2025-05-07T20:25:55.3977591Z 2025-05-07T20:25:55.3977708Z  2025-05-07T20:25:55.3977881Z 2025-05-07T20:25:55.3977886Z 2025-05-07T20:25:55.3977891Z 2025-05-07T20:25:55.3977896Z 2025-05-07T20:25:55.3977901Z 2025-05-07T20:25:55.3977906Z 2025-05-07T20:25:55.3977911Z 2025-05-07T20:25:55.3977916Z 2025-05-07T20:25:55.3977921Z 2025-05-07T20:25:55.3978100Z  2025-05-07T20:25:55.3978324Z 2025-05-07T20:25:55.3978329Z 2025-05-07T20:25:55.3978334Z 2025-05-07T20:25:55.3978340Z 2025-05-07T20:25:55.3978344Z 2025-05-07T20:25:55.3978358Z 2025-05-07T20:25:55.3978364Z 2025-05-07T20:25:55.3978369Z 2025-05-07T20:25:55.3978374Z 2025-05-07T20:25:55.3978379Z 2025-05-07T20:25:55.3978550Z  2025-05-07T20:25:55.3978779Z 2025-05-07T20:25:55.3978791Z 2025-05-07T20:25:55.3978796Z 2025-05-07T20:25:55.3978801Z 2025-05-07T20:25:55.3978806Z 2025-05-07T20:25:55.3978811Z 2025-05-07T20:25:55.3978816Z 2025-05-07T20:25:55.3978822Z 2025-05-07T20:25:55.3978827Z 2025-05-07T20:25:55.3978832Z 2025-05-07T20:25:55.3978837Z 2025-05-07T20:25:55.3979022Z  2025-05-07T20:25:55.3979208Z 2025-05-07T20:25:55.3979212Z 2025-05-07T20:25:55.3979215Z 2025-05-07T20:25:55.3979219Z 2025-05-07T20:25:55.3979223Z 2025-05-07T20:25:55.3979226Z 2025-05-07T20:25:55.3979230Z 2025-05-07T20:25:55.3979234Z 2025-05-07T20:25:55.3979237Z 2025-05-07T20:25:55.3979241Z 2025-05-07T20:25:55.3979244Z 2025-05-07T20:25:55.3979248Z 2025-05-07T20:25:55.3979384Z  2025-05-07T20:25:55.3979568Z 2025-05-07T20:25:55.3979572Z 2025-05-07T20:25:55.3979576Z 2025-05-07T20:25:55.3979579Z 2025-05-07T20:25:55.3979583Z 2025-05-07T20:25:55.3979586Z 2025-05-07T20:25:55.3979590Z 2025-05-07T20:25:55.3979593Z 2025-05-07T20:25:55.3979605Z 2025-05-07T20:25:55.3979609Z 2025-05-07T20:25:55.3979613Z 2025-05-07T20:25:55.3979616Z 2025-05-07T20:25:55.3979620Z 2025-05-07T20:25:55.3979752Z  2025-05-07T20:25:55.3979937Z 2025-05-07T20:25:55.3979941Z 2025-05-07T20:25:55.3979951Z 2025-05-07T20:25:55.3979954Z 2025-05-07T20:25:55.3979958Z 2025-05-07T20:25:55.3979961Z 2025-05-07T20:25:55.3979965Z 2025-05-07T20:25:55.3979969Z 2025-05-07T20:25:55.3979972Z 2025-05-07T20:25:55.3979976Z 2025-05-07T20:25:55.3979979Z 2025-05-07T20:25:55.3979983Z 2025-05-07T20:25:55.3979987Z 2025-05-07T20:25:55.3979991Z 2025-05-07T20:25:55.3980129Z  2025-05-07T20:25:55.3980327Z 2025-05-07T20:25:55.3980331Z 2025-05-07T20:25:55.3980338Z 2025-05-07T20:25:55.3980342Z 2025-05-07T20:25:55.3980345Z 2025-05-07T20:25:55.3980349Z 2025-05-07T20:25:55.3980352Z 2025-05-07T20:25:55.3980356Z 2025-05-07T20:25:55.3980359Z 2025-05-07T20:25:55.3980363Z 2025-05-07T20:25:55.3980462Z 2025-05-07T20:25:55.3980467Z 2025-05-07T20:25:55.3980470Z 2025-05-07T20:25:55.3980474Z 2025-05-07T20:25:55.3980477Z 2025-05-07T20:25:55.3980631Z  2025-05-07T20:25:55.3980826Z 2025-05-07T20:25:55.3980830Z 2025-05-07T20:25:55.3980833Z 2025-05-07T20:25:55.3980837Z 2025-05-07T20:25:55.3980840Z 2025-05-07T20:25:55.3980844Z 2025-05-07T20:25:55.3980847Z 2025-05-07T20:25:55.3980851Z 2025-05-07T20:25:55.3980854Z 2025-05-07T20:25:55.3980858Z 2025-05-07T20:25:55.3980861Z 2025-05-07T20:25:55.3980874Z 2025-05-07T20:25:55.3980880Z 2025-05-07T20:25:55.3980886Z 2025-05-07T20:25:55.3980890Z 2025-05-07T20:25:55.3980894Z 2025-05-07T20:25:55.3981042Z  2025-05-07T20:25:55.3981348Z 2025-05-07T20:25:55.3981352Z 2025-05-07T20:25:55.3981355Z 2025-05-07T20:25:55.3981365Z 2025-05-07T20:25:55.3981368Z 2025-05-07T20:25:55.3981372Z 2025-05-07T20:25:55.3981375Z 2025-05-07T20:25:55.3981379Z 2025-05-07T20:25:55.3981388Z 2025-05-07T20:25:55.3981392Z 2025-05-07T20:25:55.3981396Z 2025-05-07T20:25:55.3981399Z 2025-05-07T20:25:55.3981403Z 2025-05-07T20:25:55.3981406Z 2025-05-07T20:25:55.3981410Z 2025-05-07T20:25:55.3981413Z 2025-05-07T20:25:55.3981417Z 2025-05-07T20:25:55.3981571Z  2025-05-07T20:25:55.3981783Z 2025-05-07T20:25:55.3981787Z 2025-05-07T20:25:55.3981790Z 2025-05-07T20:25:55.3981794Z 2025-05-07T20:25:55.3981797Z 2025-05-07T20:25:55.3981801Z 2025-05-07T20:25:55.3981804Z 2025-05-07T20:25:55.3981808Z 2025-05-07T20:25:55.3981811Z 2025-05-07T20:25:55.3981815Z 2025-05-07T20:25:55.3981818Z 2025-05-07T20:25:55.3981822Z 2025-05-07T20:25:55.3981826Z 2025-05-07T20:25:55.3981829Z 2025-05-07T20:25:55.3981833Z 2025-05-07T20:25:55.3981841Z 2025-05-07T20:25:55.3981845Z 2025-05-07T20:25:55.3981854Z 2025-05-07T20:25:55.3982013Z  2025-05-07T20:25:55.3982219Z 2025-05-07T20:25:55.3982223Z 2025-05-07T20:25:55.3982331Z  2025-05-07T20:25:55.3982431Z 2025-05-07T20:25:55.3982435Z 2025-05-07T20:25:55.3982534Z  2025-05-07T20:25:55.3982642Z 2025-05-07T20:25:55.3982646Z 2025-05-07T20:25:55.3982649Z 2025-05-07T20:25:55.3982756Z  2025-05-07T20:25:55.3982868Z 2025-05-07T20:25:55.3982872Z 2025-05-07T20:25:55.3982875Z 2025-05-07T20:25:55.3982879Z 2025-05-07T20:25:55.3982990Z  2025-05-07T20:25:55.3983107Z 2025-05-07T20:25:55.3983110Z 2025-05-07T20:25:55.3983120Z 2025-05-07T20:25:55.3983123Z 2025-05-07T20:25:55.3983127Z 2025-05-07T20:25:55.3983234Z  2025-05-07T20:25:55.3983355Z 2025-05-07T20:25:55.3983359Z 2025-05-07T20:25:55.3983362Z 2025-05-07T20:25:55.3983366Z 2025-05-07T20:25:55.3983369Z 2025-05-07T20:25:55.3983386Z 2025-05-07T20:25:55.3983496Z  2025-05-07T20:25:55.3983620Z 2025-05-07T20:25:55.3983623Z 2025-05-07T20:25:55.3983627Z 2025-05-07T20:25:55.3983630Z 2025-05-07T20:25:55.3983634Z 2025-05-07T20:25:55.3983637Z 2025-05-07T20:25:55.3983646Z 2025-05-07T20:25:55.3983764Z  2025-05-07T20:25:55.3983900Z 2025-05-07T20:25:55.3983903Z 2025-05-07T20:25:55.3983907Z 2025-05-07T20:25:55.3983910Z 2025-05-07T20:25:55.3983914Z 2025-05-07T20:25:55.3983918Z 2025-05-07T20:25:55.3983921Z 2025-05-07T20:25:55.3983925Z 2025-05-07T20:25:55.3984063Z  done 2025-05-07T20:25:55.6022354Z Preparing transaction: | / done 2025-05-07T20:25:59.5087946Z Verifying transaction: \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:26:00.1179946Z Executing transaction: \ | / - \ | done 2025-05-07T20:26:02.2909621Z [INSTALL] Fixing file placements for CUDA 12.8.0+ ... 2025-05-07T20:26:02.2910134Z [INSTALL] Creating symlinks: libnvToolsExt.so 2025-05-07T20:26:02.2911202Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:02.2911888Z 2025-05-07T20:26:02.2923723Z 2025-05-07T20:26:02.2924583Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:02.2925330Z 2025-05-07T20:26:02.2937214Z 2025-05-07T20:26:02.2937441Z [INSTALL] Copying nvtx3 headers ... 2025-05-07T20:26:02.2942821Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/ 2025-05-07T20:26:02.2946965Z 2025-05-07T20:26:02.4587618Z 2025-05-07T20:26:02.4592840Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/ 2025-05-07T20:26:02.4596783Z 2025-05-07T20:26:02.4616746Z 2025-05-07T20:26:02.4617227Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ... 2025-05-07T20:26:02.4995067Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ... 2025-05-07T20:26:04.4009299Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error) 2025-05-07T20:26:04.4652300Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs 2025-05-07T20:26:04.4652929Z 2025-05-07T20:26:04.8853943Z 2025-05-07T20:26:04.8861957Z [INSTALL] Setting environment variable NVML_LIB_PATH ... 2025-05-07T20:26:04.9217820Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:04.9218363Z 2025-05-07T20:26:05.3600866Z 2025-05-07T20:26:05.3601217Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ... 2025-05-07T20:26:05.3602150Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/" 2025-05-07T20:26:05.3602898Z 2025-05-07T20:26:05.7838123Z 2025-05-07T20:26:07.8083280Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h 2025-05-07T20:26:09.8286925Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so 2025-05-07T20:26:11.8608313Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:11.8609145Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:13.8861211Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:15.7712773Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc 2025-05-07T20:26:15.7713074Z 2025-05-07T20:26:15.8342055Z [CHECK] Binary nvcc found in PATH 2025-05-07T20:26:19.6905643Z /tmp/tmpfly9oh1w: line 3: clang: command not found 2025-05-07T20:26:19.6905933Z 2025-05-07T20:26:19.6908034Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error) 2025-05-07T20:26:19.7536241Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d 2025-05-07T20:26:19.7536581Z 2025-05-07T20:26:19.7556471Z total 36 2025-05-07T20:26:19.7556768Z drwxr-xr-x. 2 ec2-user ec2-user 191 May 7 20:25 . 2025-05-07T20:26:19.7557172Z drwxr-xr-x. 5 ec2-user ec2-user 62 May 7 20:24 .. 2025-05-07T20:26:19.7557623Z -rw-r--r--. 2 ec2-user ec2-user 3778 Jun 10 2024 activate-binutils_linux-64.sh 2025-05-07T20:26:19.7558141Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10 2024 activate-gcc_linux-64.sh 2025-05-07T20:26:19.7558631Z -rw-r--r--. 2 ec2-user ec2-user 5190 Jun 10 2024 activate-gxx_linux-64.sh 2025-05-07T20:26:19.7559103Z -rw-r--r--. 2 ec2-user ec2-user 136 Mar 27 01:27 libglib_activate.sh 2025-05-07T20:26:19.7559691Z -rw-r--r--. 2 ec2-user ec2-user 872 Nov 13 09:20 libxml2_activate.sh 2025-05-07T20:26:19.7560321Z -rw-r--r--. 2 ec2-user ec2-user 2932 Jan 24 22:22 ~cuda-nvcc_activate.sh 2025-05-07T20:26:19.7560735Z 2025-05-07T20:26:19.7561035Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ... 2025-05-07T20:26:19.7561770Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh 2025-05-07T20:26:19.7562196Z 2025-05-07T20:26:19.7580617Z 2025-05-07T20:26:19.7580947Z + conda run -n build_binary c++ --version | grep -i clang 2025-05-07T20:26:19.7581306Z 2025-05-07T20:26:21.7237284Z 2025-05-07T20:26:21.7237856Z [BUILD] Setting prepend flags for NVCC ... 2025-05-07T20:26:21.7238437Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler" 2025-05-07T20:26:21.7238835Z 2025-05-07T20:26:22.1515611Z 2025-05-07T20:26:22.1516027Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS 2025-05-07T20:26:22.1516305Z 2025-05-07T20:26:24.0404176Z -allow-unsupported-compiler 2025-05-07T20:26:24.0404410Z 2025-05-07T20:26:24.1025693Z 2025-05-07T20:26:24.1026114Z [INFO] Printing out all preprocessor defines in nvcc ... 2025-05-07T20:26:24.1026657Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null 2025-05-07T20:26:24.1026991Z 2025-05-07T20:26:26.0524714Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead"))) 2025-05-07T20:26:26.0525582Z #define M_PIl 3.141592653589793238462643383279502884L 2025-05-07T20:26:26.0526019Z #define _IO_CURRENTLY_PUTTING 0x800 2025-05-07T20:26:26.0526339Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig)) 2025-05-07T20:26:26.0526750Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:26:26.0527116Z #define _STL_PAIR_H 1 2025-05-07T20:26:26.0527454Z #define __cpp_attributes 200809L 2025-05-07T20:26:26.0527910Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:26:26.0528311Z #define __DELETE_THROW throw() 2025-05-07T20:26:26.0528573Z #define _PTRDIFF_T_ 2025-05-07T20:26:26.0528809Z #define M_PI_4 0.78539816339744830962 2025-05-07T20:26:26.0529096Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:26:26.0529522Z #define _IO_LEFT 02 2025-05-07T20:26:26.0529803Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:26:26.0530068Z #define _POSIX2_BC_SCALE_MAX 99 2025-05-07T20:26:26.0530342Z #define _GLIBCXX_USE_RANDOM_TR1 1 2025-05-07T20:26:26.0531114Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp) 2025-05-07T20:26:26.0531560Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:26:26.0531839Z #define RE_DUP_MAX (0x7fff) 2025-05-07T20:26:26.0532226Z #define _IOS_OUTPUT 2 2025-05-07T20:26:26.0542771Z #define __SM_100_RT_HPP__ 2025-05-07T20:26:26.0543273Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:26:26.0543798Z #define toascii_l(c,l) __toascii_l ((c), (l)) 2025-05-07T20:26:26.0544263Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:26:26.0544668Z #define _GLIBCXX_USE_FCHMOD 1 2025-05-07T20:26:26.0545065Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:26:26.0546159Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; })) 2025-05-07T20:26:26.0547294Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:26:26.0547628Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:26:26.0547937Z #define cudaTextureTypeCubemapLayered 0xFC 2025-05-07T20:26:26.0548276Z #define _T_WCHAR_ 2025-05-07T20:26:26.0548513Z #define stdout stdout 2025-05-07T20:26:26.0548884Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11"))) 2025-05-07T20:26:26.0549281Z #define CHAR_BIT __CHAR_BIT__ 2025-05-07T20:26:26.0549537Z #define __flexarr [] 2025-05-07T20:26:26.0549778Z #define _GLIBCXX_HAVE_FINITEF 1 2025-05-07T20:26:26.0550093Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l)) 2025-05-07T20:26:26.0550438Z #define _IO_FLAGS2_USER_WBUF 8 2025-05-07T20:26:26.0550692Z #define _MATH_H 1 2025-05-07T20:26:26.0550971Z #define cudaOccupancyDisableCachingOverride 0x01 2025-05-07T20:26:26.0551317Z #define __S64_TYPE long int 2025-05-07T20:26:26.0551582Z #define __stub_fchflags 2025-05-07T20:26:26.0551843Z #define cudaDeviceScheduleMask 0x07 2025-05-07T20:26:26.0552141Z #define __SQUAD_TYPE long int 2025-05-07T20:26:26.0552412Z #define __INTMAX_C(c) c ## L 2025-05-07T20:26:26.0552716Z #define cudaStreamFireAndForget ((cudaStream_t)0x4) 2025-05-07T20:26:26.0553060Z #define _BSD_SIZE_T_DEFINED_ 2025-05-07T20:26:26.0553324Z #define NL_NMAX INT_MAX 2025-05-07T20:26:26.0553564Z #define _BITS_TIME_H 1 2025-05-07T20:26:26.0553857Z #define M_LN10l 2.302585092994045684017991454684364208L 2025-05-07T20:26:26.0554217Z #define _GLIBCXX_TXN_SAFE_DYN 2025-05-07T20:26:26.0554527Z #define cudaStreamTailLaunch ((cudaStream_t)0x3) 2025-05-07T20:26:26.0554881Z #define M_El 2.718281828459045235360287471352662498L 2025-05-07T20:26:26.0555287Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd) 2025-05-07T20:26:26.0555660Z #define __CHAR_BIT__ 8 2025-05-07T20:26:26.0555919Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:26.0556243Z #define _PSTL_STRING_CONCAT(x,y) x #y 2025-05-07T20:26:26.0556547Z #define _GLIBCXX98_USE_C99_MATH 1 2025-05-07T20:26:26.0556814Z #define FP_NAN 0 2025-05-07T20:26:26.0557091Z #define makedev(maj,min) gnu_dev_makedev (maj, min) 2025-05-07T20:26:26.0557515Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2 2025-05-07T20:26:26.0557913Z #define __cudaCDP2GetErrorString 2025-05-07T20:26:26.0558197Z #define SHRT_MAX __SHRT_MAX__ 2025-05-07T20:26:26.0558466Z #define _GLIBCXX_X86_RDSEED 1 2025-05-07T20:26:26.0558726Z #define __SM_80_RT_H__ 2025-05-07T20:26:26.0558950Z #define _NEW 2025-05-07T20:26:26.0559186Z #define CLOCK_PROCESS_CPUTIME_ID 2 2025-05-07T20:26:26.0559476Z #define __UINT8_MAX__ 0xff 2025-05-07T20:26:26.0559843Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition) 2025-05-07T20:26:26.0560257Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:26:26.0560502Z #define __USE_ANSI 1 2025-05-07T20:26:26.0560787Z #define _IO_BE(expr,res) __builtin_expect ((expr), res) 2025-05-07T20:26:26.0561194Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l)) 2025-05-07T20:26:26.0561564Z #define __cudaCDP2Memcpy2DAsync_ptsz 2025-05-07T20:26:26.0561870Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:26:26.0562266Z #define __SIZEOF_PTHREAD_ATTR_T 56 2025-05-07T20:26:26.0562559Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:26:26.0562842Z #define _GLIBCXX_END_NAMESPACE_LDBL 2025-05-07T20:26:26.0563128Z #define PIPE_BUF 4096 2025-05-07T20:26:26.0563453Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 2025-05-07T20:26:26.0563915Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11 2025-05-07T20:26:26.0564335Z #define ADJ_TICK 0x4000 2025-05-07T20:26:26.0564617Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10) 2025-05-07T20:26:26.0564942Z #define MQ_PRIO_MAX 32768 2025-05-07T20:26:26.0565203Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4 2025-05-07T20:26:26.0565532Z #define __WAIT_INT(status) (*(int *) &(status)) 2025-05-07T20:26:26.0566003Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:26:26.0566635Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01 2025-05-07T20:26:26.0567002Z #define _XOPEN_SOURCE 700 2025-05-07T20:26:26.0567267Z #define _POSIX2_BC_DIM_MAX 2048 2025-05-07T20:26:26.0567548Z #define __VECTOR_FUNCTIONS_HPP__ 2025-05-07T20:26:26.0567829Z #define __cpp_static_assert 201411L 2025-05-07T20:26:26.0568118Z #define __GLIBCXX__ 20230528 2025-05-07T20:26:26.0568388Z #define _GLIBCXX_HAVE_STRXFRM_L 1 2025-05-07T20:26:26.0568659Z #define _POSIX_TTY_NAME_MAX 9 2025-05-07T20:26:26.0568945Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__ 2025-05-07T20:26:26.0569252Z #define __OFF_T_MATCHES_OFF64_T 1 2025-05-07T20:26:26.0569529Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:26:26.0569835Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:26.0570198Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l)) 2025-05-07T20:26:26.0570546Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:26:26.0570830Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1 2025-05-07T20:26:26.0571152Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:26.0571515Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l)) 2025-05-07T20:26:26.0571876Z #define cudaNvSciSyncAttrSignal 0x1 2025-05-07T20:26:26.0572364Z #define _GLIBCXX_USE_LONG_LONG 1 2025-05-07T20:26:26.0572665Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:26:26.0572986Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:26:26.0573313Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:26:26.0573722Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:26:26.0574188Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:26:26.0574491Z #define ADJ_ESTERROR 0x0008 2025-05-07T20:26:26.0574759Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:26:26.0575047Z #define __GCC_IEC_559 2 2025-05-07T20:26:26.0575338Z #define __cpp_lib_transformation_trait_aliases 201304 2025-05-07T20:26:26.0575683Z #define _IO_flockfile(_fp) 2025-05-07T20:26:26.0575950Z #define CLOCK_MONOTONIC_RAW 4 2025-05-07T20:26:26.0576216Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:26:26.0576478Z #define _IOFBF 0 2025-05-07T20:26:26.0576700Z #define __USE_BSD 1 2025-05-07T20:26:26.0576922Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:26:26.0577196Z #define SHRT_MIN (-SHRT_MAX - 1) 2025-05-07T20:26:26.0577476Z #define _IO_USER_LOCK 0x8000 2025-05-07T20:26:26.0577725Z #define _IO_NO_WRITES 8 2025-05-07T20:26:26.0577983Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 2025-05-07T20:26:26.0578342Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname 2025-05-07T20:26:26.0578697Z #define _GLIBCXX_HAVE_SYS_STAT_H 1 2025-05-07T20:26:26.0578999Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ()) 2025-05-07T20:26:26.0579323Z #define __cpp_binary_literals 201304L 2025-05-07T20:26:26.0579618Z #define _CPP_TYPE_TRAITS_H 1 2025-05-07T20:26:26.0579882Z #define __BEGIN_NAMESPACE_C99 2025-05-07T20:26:26.0580155Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:26:26.0580480Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 2025-05-07T20:26:26.0580866Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE) 2025-05-07T20:26:26.0581349Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:26:26.0581664Z #define M_PI 3.14159265358979323846 2025-05-07T20:26:26.0581969Z #define _GLIBCXX_PACKAGE_NAME "package-unused" 2025-05-07T20:26:26.0582303Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1 2025-05-07T20:26:26.0582617Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:26:26.0582925Z #define _POSIX_DELAYTIMER_MAX 32 2025-05-07T20:26:26.0583197Z #define _GLIBCXX_USE_UTIME 1 2025-05-07T20:26:26.0583472Z #define _STL_ITERATOR_BASE_FUNCS_H 1 2025-05-07T20:26:26.0584119Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr) 2025-05-07T20:26:26.0584711Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1 2025-05-07T20:26:26.0585043Z #define w_termsig __wait_terminated.__w_termsig 2025-05-07T20:26:26.0585470Z #define __FLOAT_WORD_ORDER __BYTE_ORDER 2025-05-07T20:26:26.0585772Z #define __cudaCDP2GetErrorName 2025-05-07T20:26:26.0586055Z #define XATTR_SIZE_MAX 65536 2025-05-07T20:26:26.0586328Z #define be64toh(x) __bswap_64 (x) 2025-05-07T20:26:26.0586635Z #define __ASSERT_VOID_CAST static_cast 2025-05-07T20:26:26.0586972Z #define __cpp_variadic_templates 200704L 2025-05-07T20:26:26.0587277Z #define RAND_MAX 2147483647 2025-05-07T20:26:26.0587547Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1 2025-05-07T20:26:26.0587873Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:26.0588193Z #define __SM_90_RT_H__ 2025-05-07T20:26:26.0588440Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:26:26.0588698Z #define __COMPAR_FN_T 2025-05-07T20:26:26.0588946Z #define __GID_T_TYPE __U32_TYPE 2025-05-07T20:26:26.0589211Z #define _IO_BAD_SEEN 0x4000 2025-05-07T20:26:26.0589687Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x))) 2025-05-07T20:26:26.0590211Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:26:26.0590563Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 2025-05-07T20:26:26.0590921Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:26:26.0591226Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 2025-05-07T20:26:26.0591566Z #define cudaArrayColorAttachment 0x20 2025-05-07T20:26:26.0591883Z #define __cpp_variable_templates 201304L 2025-05-07T20:26:26.0592390Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:26:26.0592939Z #define __cpp_lib_integral_constant_callable 201304 2025-05-07T20:26:26.0593276Z #define _GLIBCXX_HAVE_SINHF 1 2025-05-07T20:26:26.0593546Z #define MOD_TIMECONST ADJ_TIMECONST 2025-05-07T20:26:26.0593851Z #define __cpp_lib_result_of_sfinae 201210 2025-05-07T20:26:26.0594185Z #define __SM_30_INTRINSICS_H__ 2025-05-07T20:26:26.0594469Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:26:26.0594740Z #define _GLIBCXX_USE_WCHAR_T 1 2025-05-07T20:26:26.0595010Z #define _GLIBCXX_MATH_H 1 2025-05-07T20:26:26.0595251Z #define __u_char_defined 2025-05-07T20:26:26.0595572Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status)) 2025-05-07T20:26:26.0595940Z #define STA_PPSERROR 0x0800 2025-05-07T20:26:26.0596197Z #define _GLIBCXX_STD_A std 2025-05-07T20:26:26.0596446Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:26:26.0596750Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 2025-05-07T20:26:26.0597193Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type) 2025-05-07T20:26:26.0597618Z #define FP_INFINITE 1 2025-05-07T20:26:26.0597993Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:26.0598418Z #define _IO_pid_t __pid_t 2025-05-07T20:26:26.0598672Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:26:26.0598934Z #define __LEAF , __leaf__ 2025-05-07T20:26:26.0599180Z #define PATH_MAX 4096 2025-05-07T20:26:26.0599429Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:26:26.0599767Z #define __LDBL_REDIR1(name,proto,alias) name proto 2025-05-07T20:26:26.0600099Z #define _LIMITS_H___ 2025-05-07T20:26:26.0600323Z #define __size_t 2025-05-07T20:26:26.0600561Z #define _GLIBCXX_HAVE_FREXPF 1 2025-05-07T20:26:26.0601203Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK) 2025-05-07T20:26:26.0601783Z #define _GLIBCXX_HAVE_FREXPL 1 2025-05-07T20:26:26.0602091Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:26:26.0602424Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:26:26.0602690Z #define _WCHAR_T_DEFINED 2025-05-07T20:26:26.0603043Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 2025-05-07T20:26:26.0603447Z #define MOD_STATUS ADJ_STATUS 2025-05-07T20:26:26.0603744Z #define _GLIBCXX_PURE __attribute__ ((__pure__)) 2025-05-07T20:26:26.0604065Z #define _GLIBCXX_HAVE_STDINT_H 1 2025-05-07T20:26:26.0604355Z #define __SIZEOF_PTHREAD_CONDATTR_T 4 2025-05-07T20:26:26.0604638Z #define __INT8_C(c) c 2025-05-07T20:26:26.0604992Z #define __cudaCDP2GetParameterBuffer 2025-05-07T20:26:26.0605287Z #define _GLIBCXX_HAVE_COSHF 1 2025-05-07T20:26:26.0605552Z #define _GLIBCXX_HAVE_COSHL 1 2025-05-07T20:26:26.0605819Z #define __SM_70_RT_HPP__ 2025-05-07T20:26:26.0606064Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:26:26.0606822Z #define __cpp_variadic_using 201611L 2025-05-07T20:26:26.0607156Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:26.0607482Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:26:26.0607756Z #define __SM_61_INTRINSICS_HPP__ 2025-05-07T20:26:26.0608033Z #define _IO_FLAGS2_MMAP 1 2025-05-07T20:26:26.0608293Z #define __cpp_capture_star_this 201603L 2025-05-07T20:26:26.0608613Z #define __cudaCDP2LaunchDeviceV2_ptsz 2025-05-07T20:26:26.0608919Z #define _GLIBCXX_HAVE_ENDIAN_H 1 2025-05-07T20:26:26.0609280Z #define __always_inline __inline __attribute__ ((__always_inline__)) 2025-05-07T20:26:26.0609665Z #define NFDBITS __NFDBITS 2025-05-07T20:26:26.0609927Z #define _PSTL_PRAGMA_FORCEINLINE 2025-05-07T20:26:26.0610223Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1 2025-05-07T20:26:26.0610542Z #define __glibcxx_requires_sorted(_First,_Last) 2025-05-07T20:26:26.0610865Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:26:26.0611132Z #define _GLIBCXX_SYMVER_GNU 1 2025-05-07T20:26:26.0611417Z #define w_stopval __wait_stopped.__w_stopval 2025-05-07T20:26:26.0611723Z #define STA_UNSYNC 0x0040 2025-05-07T20:26:26.0612128Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:26:26.0612545Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX 2025-05-07T20:26:26.0612913Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:26:26.0613201Z #define __cpp_if_constexpr 201606L 2025-05-07T20:26:26.0613515Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 2025-05-07T20:26:26.0613844Z #define _GLIBCXX_HAVE_WCHAR_H 1 2025-05-07T20:26:26.0614167Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO 2025-05-07T20:26:26.0614506Z #define __daddr_t_defined 2025-05-07T20:26:26.0614761Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:26:26.0615040Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1 2025-05-07T20:26:26.0615361Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1 2025-05-07T20:26:26.0615877Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800)) 2025-05-07T20:26:26.0616372Z #define _ACRTIMP 2025-05-07T20:26:26.0616598Z #define _IO_EOF_SEEN 0x10 2025-05-07T20:26:26.0616860Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1 2025-05-07T20:26:26.0617154Z #define _IOS_BIN 128 2025-05-07T20:26:26.0617513Z #define __fortify_function __extern_always_inline __attribute_artificial__ 2025-05-07T20:26:26.0617930Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:26:26.0618202Z #define UNDERFLOW 4 2025-05-07T20:26:26.0618422Z #define NAME_MAX 255 2025-05-07T20:26:26.0618662Z #define SCHAR_MAX __SCHAR_MAX__ 2025-05-07T20:26:26.0618930Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:26:26.0619212Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:26:26.0619512Z #define _IO_UNIFIED_JUMPTABLES 1 2025-05-07T20:26:26.0619888Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:26:26.0620282Z #define __ptr_t void * 2025-05-07T20:26:26.0620809Z #define M_E 2.7182818284590452354 2025-05-07T20:26:26.0621086Z #define cudaSurfaceType1D 0x01 2025-05-07T20:26:26.0621355Z #define __USE_ISOCXX11 1 2025-05-07T20:26:26.0621624Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:26:26.0621939Z #define cudaDeviceBlockingSync 0x04 2025-05-07T20:26:26.0622237Z #define CLOCK_MONOTONIC_COARSE 6 2025-05-07T20:26:26.0622517Z #define _GLIBCXX_OS_DEFINES 1 2025-05-07T20:26:26.0622803Z #define _GLIBCXX_NODISCARD [[__nodiscard__]] 2025-05-07T20:26:26.0623124Z #define cudaSurfaceType2D 0x02 2025-05-07T20:26:26.0623388Z #define __linux 1 2025-05-07T20:26:26.0623619Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:26:26.0623888Z #define cudaDeviceMask 0xff 2025-05-07T20:26:26.0624158Z #define _GLIBCXX_END_NAMESPACE_ALGO 2025-05-07T20:26:26.0624594Z #define __CUDA_API_VER_MAJOR__ 12 2025-05-07T20:26:26.0624870Z #define htobe16(x) __bswap_16 (x) 2025-05-07T20:26:26.0625159Z #define HUGE_VALF (__builtin_huge_valf()) 2025-05-07T20:26:26.0625476Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:26:26.0625781Z #define HUGE_VALL (__builtin_huge_vall()) 2025-05-07T20:26:26.0626079Z #define _BITS_TYPES_H 1 2025-05-07T20:26:26.0626368Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL) 2025-05-07T20:26:26.0626708Z #define _IO_cleanup_region_end(_Doit) 2025-05-07T20:26:26.0627014Z #define cudaSurfaceType3D 0x03 2025-05-07T20:26:26.0627298Z #define _GLIBCXX_HAVE_SYS_TIME_H 1 2025-05-07T20:26:26.0627592Z #define __cudaGet_blockIdx() blockIdx 2025-05-07T20:26:26.0627878Z #define _IO_DONT_CLOSE 0100000 2025-05-07T20:26:26.0628686Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib) 2025-05-07T20:26:26.0629531Z #define cudaHostRegisterDefault 0x00 2025-05-07T20:26:26.0629813Z #define __unix 1 2025-05-07T20:26:26.0630034Z #define MATH_ERRNO 1 2025-05-07T20:26:26.0630281Z #define _GLIBCXX_STDIO_SEEK_END 2 2025-05-07T20:26:26.0630565Z #define _GLIBCXX_USE_FCHMODAT 1 2025-05-07T20:26:26.0630830Z #define __SM_100_RT_H__ 2025-05-07T20:26:26.0631084Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:26:26.0631366Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:26:26.0631659Z #define __UID_T_TYPE __U32_TYPE 2025-05-07T20:26:26.0631939Z #define _GLIBCXX20_DEPRECATED(MSG) 2025-05-07T20:26:26.0632243Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1 2025-05-07T20:26:26.0632708Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10)) 2025-05-07T20:26:26.0633182Z #define __nv_pure__ __location__(nv_pure) 2025-05-07T20:26:26.0633487Z #define CUDARTAPI_CDECL 2025-05-07T20:26:26.0633740Z #define _PSTL_USAGE_WARNINGS 0 2025-05-07T20:26:26.0634020Z #define _GLIBCXX98_USE_C99_COMPLEX 1 2025-05-07T20:26:26.0634311Z #define __cpp_lib_void_t 201411 2025-05-07T20:26:26.0634574Z #define _POSIX_AIO_MAX 1 2025-05-07T20:26:26.0634812Z #define __SIZE_T 2025-05-07T20:26:26.0635060Z #define isgraph_l(c,l) __isgraph_l ((c), (l)) 2025-05-07T20:26:26.0635382Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0 2025-05-07T20:26:26.0635678Z #define _POSIX_PIPE_BUF 512 2025-05-07T20:26:26.0635940Z #define __CUDA_RUNTIME_API_H__ 2025-05-07T20:26:26.0636212Z #define _GLIBCXX_HAVE_STRTOLD 1 2025-05-07T20:26:26.0636469Z #define _ATFILE_SOURCE 1 2025-05-07T20:26:26.0636863Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false) 2025-05-07T20:26:26.0637306Z #define __WAIT_STATUS void * 2025-05-07T20:26:26.0637566Z #define __MATH_FUNCTIONS_H__ 2025-05-07T20:26:26.0637835Z #define _GLIBCXX_HAVE_WCSTOF 1 2025-05-07T20:26:26.0638107Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:26:26.0638394Z #define _GLIBCXX_HAVE_LC_MESSAGES 1 2025-05-07T20:26:26.0638673Z #define __WINT_MIN__ 0U 2025-05-07T20:26:26.0639275Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L) 2025-05-07T20:26:26.0640029Z #define isdigit_l(c,l) __isdigit_l ((c), (l)) 2025-05-07T20:26:26.0640340Z #define WUNTRACED 2 2025-05-07T20:26:26.0640575Z #define _GLIBCXX_HAVE_SQRTF 1 2025-05-07T20:26:26.0640854Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8 2025-05-07T20:26:26.0641137Z #define NZERO 20 2025-05-07T20:26:26.0641368Z #define _GLIBCXX_HAVE_MEMALIGN 1 2025-05-07T20:26:26.0641650Z #define _PSTL_PRAGMA(x) _Pragma(#x) 2025-05-07T20:26:26.0641941Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT 2025-05-07T20:26:26.0642236Z #define MOD_CLKB ADJ_TICK 2025-05-07T20:26:26.0642496Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:26:26.0642779Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:26:26.0643059Z #define __DEVICE_FUNCTIONS_H__ 2025-05-07T20:26:26.0643341Z #define SCHAR_MIN (-SCHAR_MAX - 1) 2025-05-07T20:26:26.0643618Z #define EXIT_FAILURE 1 2025-05-07T20:26:26.0644003Z #define ADJ_MAXERROR 0x0004 2025-05-07T20:26:26.0644269Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:26:26.0644533Z #define _SIZE_T_DEFINED_ 2025-05-07T20:26:26.0644791Z #define _POSIX_AIO_LISTIO_MAX 2 2025-05-07T20:26:26.0645080Z #define __cudaCDP2DeviceGetLimit 2025-05-07T20:26:26.0645428Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW 2025-05-07T20:26:26.0645788Z #define __cudaCDP2FuncGetAttributes 2025-05-07T20:26:26.0646087Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:26:26.0646343Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:26:26.0646612Z #define __USING_NAMESPACE_STD(name) 2025-05-07T20:26:26.0646915Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1 2025-05-07T20:26:26.0647230Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:26:26.0647521Z #define SEEK_DATA 3 2025-05-07T20:26:26.0647755Z #define __KERNEL_STRICT_NAMES 2025-05-07T20:26:26.0648059Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_)) 2025-05-07T20:26:26.0648479Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0) 2025-05-07T20:26:26.0649672Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:26:26.0650391Z 2025-05-07T20:26:26.0650488Z #define _FUNCTEXCEPT_H 1 2025-05-07T20:26:26.0650750Z #define __INT64_C(c) c ## L 2025-05-07T20:26:26.0651017Z #define __NTH(fct) __LEAF_ATTR fct throw () 2025-05-07T20:26:26.0651358Z #define _GLIBCXX_CONST __attribute__ ((__const__)) 2025-05-07T20:26:26.0651687Z #define _GLIBCXX_HAVE_LINK 1 2025-05-07T20:26:26.0652069Z #define cudaNvSciSyncAttrWait 0x2 2025-05-07T20:26:26.0652401Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:26:26.0652705Z #define STA_PPSWANDER 0x0400 2025-05-07T20:26:26.0652959Z #define __INT_WCHAR_T_H 2025-05-07T20:26:26.0653193Z #define WSTOPPED 2 2025-05-07T20:26:26.0653433Z #define _POSIX_THREAD_THREADS_MAX 64 2025-05-07T20:26:26.0653722Z #define _POSIX_MQ_OPEN_MAX 8 2025-05-07T20:26:26.0654002Z #define FP_NORMAL 4 2025-05-07T20:26:26.0654269Z #define __cudaCDP2LaunchDevice_ptsz 2025-05-07T20:26:26.0654557Z #define _BITS_TIMEX_H 1 2025-05-07T20:26:26.0654789Z #define _POSIX_LINK_MAX 8 2025-05-07T20:26:26.0655051Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1 2025-05-07T20:26:26.0655336Z #define _GLIBCXX_HAVE_ATAN2F 1 2025-05-07T20:26:26.0655604Z #define cudaTextureType1D 0x01 2025-05-07T20:26:26.0655879Z #define _GLIBCXX_HAVE_ATAN2L 1 2025-05-07T20:26:26.0656147Z #define COLL_WEIGHTS_MAX 255 2025-05-07T20:26:26.0656414Z #define __isascii(c) (((c) & ~0x7f) == 0) 2025-05-07T20:26:26.0656712Z #define __toascii(c) ((c) & 0x7f) 2025-05-07T20:26:26.0657148Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b))) 2025-05-07T20:26:26.0657604Z #define _IO_MAGIC 0xFBAD0000 2025-05-07T20:26:26.0657864Z #define _GLIBCXX_USE_SENDFILE 1 2025-05-07T20:26:26.0658144Z #define _POSIX_SOURCE 1 2025-05-07T20:26:26.0658402Z #define cudaTextureType2D 0x02 2025-05-07T20:26:26.0658668Z #define _PTR_TRAITS_H 1 2025-05-07T20:26:26.0658943Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE) 2025-05-07T20:26:26.0659263Z #define _GLIBCXX_HAVE_POWF 1 2025-05-07T20:26:26.0659528Z #define _POSIX2_BC_STRING_MAX 1000 2025-05-07T20:26:26.0659955Z #define __attribute_used__ __attribute__ ((__used__)) 2025-05-07T20:26:26.0660302Z #define cudaTextureType3D 0x03 2025-05-07T20:26:26.0660573Z #define _STDIO_USES_IOSTREAM 2025-05-07T20:26:26.0660836Z #define CLOCK_REALTIME 0 2025-05-07T20:26:26.0661089Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:26:26.0661361Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:26:26.0661669Z #define __cpp_aligned_new 201606L 2025-05-07T20:26:26.0661952Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:26:26.0662235Z #define cudaEventBlockingSync 0x01 2025-05-07T20:26:26.0675906Z #define _GLIBCXX_HAVE_TANL 1 2025-05-07T20:26:26.0676256Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1 2025-05-07T20:26:26.0676566Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1 2025-05-07T20:26:26.0677034Z #define _GLIBCXX_USE_C99_FENV_TR1 1 2025-05-07T20:26:26.0677305Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:26:26.0677552Z #define __GLIBC__ 2 2025-05-07T20:26:26.0677759Z #define __END_DECLS } 2025-05-07T20:26:26.0677994Z #define FP_ILOGB0 (-2147483647 - 1) 2025-05-07T20:26:26.0678354Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:26:26.0678721Z #define __CONCAT(x,y) x ## y 2025-05-07T20:26:26.0678965Z #define WCONTINUED 8 2025-05-07T20:26:26.0679187Z #define __STDC_HOSTED__ 1 2025-05-07T20:26:26.0679426Z #define _GLIBCXX_HAVE_ARPA_INET_H 1 2025-05-07T20:26:26.0679692Z #define _ALLOCA_H 1 2025-05-07T20:26:26.0679907Z #define __host__ __location__(host) 2025-05-07T20:26:26.0680326Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg))) 2025-05-07T20:26:26.0680760Z #define __SLONG32_TYPE int 2025-05-07T20:26:26.0681027Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1 2025-05-07T20:26:26.0681310Z #define _SYS_SELECT_H 1 2025-05-07T20:26:26.0681540Z #define _IO_LINE_BUF 0x200 2025-05-07T20:26:26.0681780Z #define _IOS_NOCREATE 32 2025-05-07T20:26:26.0682018Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:26:26.0682284Z #define __cudaGet_warpSize() warpSize 2025-05-07T20:26:26.0682578Z #define __SSIZE_T_TYPE __SWORD_TYPE 2025-05-07T20:26:26.0682859Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0 2025-05-07T20:26:26.0683137Z #define __global__ __location__(global) 2025-05-07T20:26:26.0683425Z #define __GNU_LIBRARY__ 6 2025-05-07T20:26:26.0683686Z #define __cpp_decltype_auto 201304L 2025-05-07T20:26:26.0683978Z #define __DBL_DIG__ 15 2025-05-07T20:26:26.0684243Z #define TIME_UTC 1 2025-05-07T20:26:26.0684466Z #define __FLT32_DIG__ 6 2025-05-07T20:26:26.0684802Z #define __forceinline__ __inline__ __attribute__((always_inline)) 2025-05-07T20:26:26.0685202Z #define cudaHostAllocWriteCombined 0x04 2025-05-07T20:26:26.0685529Z #define cudaDeviceScheduleAuto 0x00 2025-05-07T20:26:26.0685853Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l)) 2025-05-07T20:26:26.0686165Z #define _G_BUFSIZ 8192 2025-05-07T20:26:26.0686476Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:26:26.0686857Z #define cudaTextureTypeCubemap 0x0C 2025-05-07T20:26:26.0687158Z #define __cudaCDP2GetDevice 2025-05-07T20:26:26.0687453Z #define __cudaCDP2PeekAtLastError 2025-05-07T20:26:26.0687749Z #define STA_CLOCKERR 0x1000 2025-05-07T20:26:26.0687986Z #define __GXX_WEAK__ 1 2025-05-07T20:26:26.0688241Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:26.0688547Z #define _GLIBCXX_HAVE_ISNANF 1 2025-05-07T20:26:26.0688802Z #define __SHRT_WIDTH__ 16 2025-05-07T20:26:26.0689102Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304 2025-05-07T20:26:26.0689445Z #define _GLIBCXX_BITS_SPECFUN_H 1 2025-05-07T20:26:26.0689725Z #define _GLIBCXX_HAVE_ISNANL 1 2025-05-07T20:26:26.0690006Z #define isblank_l(c,l) __isblank_l ((c), (l)) 2025-05-07T20:26:26.0690307Z #define _G_config_h 1 2025-05-07T20:26:26.0690593Z #define M_LOG2El 1.442695040888963407359924681001892137L 2025-05-07T20:26:26.0690932Z #define ADJ_OFFSET_SINGLESHOT 0x8001 2025-05-07T20:26:26.0691215Z #define _GCC_WCHAR_T 2025-05-07T20:26:26.0691450Z #define TMP_MAX 238328 2025-05-07T20:26:26.0691685Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:26:26.0692252Z #define __DEVICE_TYPES_H__ 2025-05-07T20:26:26.0692520Z #define __DEV_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:26.0692794Z #define _EXT_NUMERIC_TRAITS 1 2025-05-07T20:26:26.0693074Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 2025-05-07T20:26:26.0693367Z #define _IO_SKIPWS 01 2025-05-07T20:26:26.0693770Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000 2025-05-07T20:26:26.0694277Z #define _IO_SCIENTIFIC 04000 2025-05-07T20:26:26.0694564Z #define _GLIBCXX_HAVE_STRING_H 1 2025-05-07T20:26:26.0694906Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:26:26.0695273Z #define cudaDeviceScheduleSpin 0x01 2025-05-07T20:26:26.0695648Z #define __nonnull(params) __attribute__ ((__nonnull__ params)) 2025-05-07T20:26:26.0696103Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:26:26.0696353Z #define le32toh(x) (x) 2025-05-07T20:26:26.0696599Z #define _SIZE_T_DEFINED 2025-05-07T20:26:26.0696851Z #define _GLIBCXX_HAVE_XLOCALE_H 1 2025-05-07T20:26:26.0697195Z #define cudaArraySparsePropertiesSingleMipTail 0x1 2025-05-07T20:26:26.0697551Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:26:26.0697956Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0) 2025-05-07T20:26:26.0698380Z #define _GLIBCXX_HAVE_FMODL 1 2025-05-07T20:26:26.0698642Z #define _GLIBCXX_HAVE_POLL 1 2025-05-07T20:26:26.0698909Z #define __SM_32_INTRINSICS_H__ 2025-05-07T20:26:26.0699178Z #define _POSIX_NAME_MAX 14 2025-05-07T20:26:26.0699456Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:26:26.0699995Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter) 2025-05-07T20:26:26.0700505Z #define _GLIBCXX_USE_CLOCK_REALTIME 1 2025-05-07T20:26:26.0700813Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:26:26.0701174Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG) 2025-05-07T20:26:26.0701491Z #define _WCHAR_T_ 2025-05-07T20:26:26.0701712Z #define _GLIBCXX_FAST_MATH 0 2025-05-07T20:26:26.0702086Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:26:26.0702479Z #define RTSIG_MAX 32 2025-05-07T20:26:26.0702702Z #define _STDDEF_H 2025-05-07T20:26:26.0702928Z #define CU_UUID_HAS_BEEN_DEFINED 2025-05-07T20:26:26.0703203Z #define _VA_LIST_DEFINED 2025-05-07T20:26:26.0703461Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:26:26.0703794Z #define __glibcxx_requires_non_empty_range(_First,_Last) 2025-05-07T20:26:26.0704240Z #define __grid_constant__ __location__(grid_constant) 2025-05-07T20:26:26.0704575Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:26:26.0704862Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" { 2025-05-07T20:26:26.0705332Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L) 2025-05-07T20:26:26.0705879Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B)) 2025-05-07T20:26:26.0706552Z #define __SIZEOF_PTHREAD_COND_T 48 2025-05-07T20:26:26.0706982Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 2025-05-07T20:26:26.0707409Z #define __unix__ 1 2025-05-07T20:26:26.0707726Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:26.0708015Z #define __INT_WIDTH__ 32 2025-05-07T20:26:26.0708264Z #define __SIZEOF_LONG__ 8 2025-05-07T20:26:26.0708503Z #define _IONBF 2 2025-05-07T20:26:26.0708948Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib) 2025-05-07T20:26:26.0709726Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++) 2025-05-07T20:26:26.0710275Z #define __STDC_IEC_559__ 1 2025-05-07T20:26:26.0710536Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:26:26.0710802Z #define __UINT16_C(c) c 2025-05-07T20:26:26.0711047Z #define M_2_PI 0.63661977236758134308 2025-05-07T20:26:26.0711332Z #define STA_DEL 0x0020 2025-05-07T20:26:26.0711571Z #define __CUDACC_VER_MINOR__ 8 2025-05-07T20:26:26.0711830Z #define __id_t_defined 2025-05-07T20:26:26.0712381Z #define w_retcode __wait_terminated.__w_retcode 2025-05-07T20:26:26.0712837Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base) 2025-05-07T20:26:26.0713273Z #define _GLIBCXX_HAVE_MODFF 1 2025-05-07T20:26:26.0713546Z #define _GLIBCXX_HAVE_MODFL 1 2025-05-07T20:26:26.0713801Z #define __DECIMAL_DIG__ 21 2025-05-07T20:26:26.0714094Z #define _POSIX2_RE_DUP_MAX 255 2025-05-07T20:26:26.0714378Z #define __USE_FORTIFY_LEVEL 0 2025-05-07T20:26:26.0714652Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:26:26.0714914Z #define SING 2 2025-05-07T20:26:26.0715134Z #define STA_FREQHOLD 0x0080 2025-05-07T20:26:26.0715409Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:26.0715708Z #define cudaStreamDefault 0x00 2025-05-07T20:26:26.0716062Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:26:26.0716628Z #define _GLIBCXX_HAVE_HYPOTL 1 2025-05-07T20:26:26.0716895Z #define _GLIBCXX_HAVE_SYS_UIO_H 1 2025-05-07T20:26:26.0717165Z #define __gnu_linux__ 1 2025-05-07T20:26:26.0717412Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:26:26.0717668Z #define _LARGEFILE_SOURCE 1 2025-05-07T20:26:26.0717967Z #define MAX_INPUT 255 2025-05-07T20:26:26.0718213Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:26:26.0718539Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l)) 2025-05-07T20:26:26.0718913Z #define __glibcxx_requires_heap(_First,_Last) 2025-05-07T20:26:26.0719234Z #define _GLIBCXX_CPU_DEFINES 1 2025-05-07T20:26:26.0719502Z #define _GLIBCXX_HAVE_POLL_H 1 2025-05-07T20:26:26.0719895Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__)) 2025-05-07T20:26:26.0720325Z #define _IO_SHOWPOS 02000 2025-05-07T20:26:26.0720656Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1 2025-05-07T20:26:26.0721012Z #define _Mfloat_ float 2025-05-07T20:26:26.0721288Z #define __glibcxx_requires_cond(_Cond,_Msg) 2025-05-07T20:26:26.0721606Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:26:26.0721892Z #define DELAYTIMER_MAX 2147483647 2025-05-07T20:26:26.0722228Z #define cudaMemPoolCreateUsageHwDecompress 0x2 2025-05-07T20:26:26.0722775Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0) 2025-05-07T20:26:26.0723275Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:26:26.0723547Z #define _GLIBCXX98_USE_C99_STDIO 1 2025-05-07T20:26:26.0723878Z #define cudaKernelNodeAttrID cudaLaunchAttributeID 2025-05-07T20:26:26.0724279Z #define __glibcxx_class_requires2(_a,_b,_c) 2025-05-07T20:26:26.0724584Z #define __USE_ISOC11 1 2025-05-07T20:26:26.0724822Z #define _BSD_SIZE_T_ 2025-05-07T20:26:26.0725061Z #define ADJ_MICRO 0x1000 2025-05-07T20:26:26.0725307Z #define _GLIBCXX_HAVE_FABSF 1 2025-05-07T20:26:26.0725574Z #define _GLIBCXX_HAVE_FABSL 1 2025-05-07T20:26:26.0725882Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd) 2025-05-07T20:26:26.0726198Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:26:26.0726514Z #define __attribute_const__ __attribute__ ((__const__)) 2025-05-07T20:26:26.0726853Z #define __THROW throw () 2025-05-07T20:26:26.0727111Z #define __cudaGet_gridDim() gridDim 2025-05-07T20:26:26.0727400Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:26.0727760Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 2025-05-07T20:26:26.0728120Z #define htobe32(x) __bswap_32 (x) 2025-05-07T20:26:26.0728399Z #define _GLIBCXX_HAVE_POWL 1 2025-05-07T20:26:26.0728665Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:26:26.0728936Z #define __GLIBC_HAVE_LONG_LONG 1 2025-05-07T20:26:26.0729194Z #define L_tmpnam 20 2025-05-07T20:26:26.0729428Z #define ___int_wchar_t_h 2025-05-07T20:26:26.0729773Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status)) 2025-05-07T20:26:26.0730153Z #define isascii(c) __isascii (c) 2025-05-07T20:26:26.0730415Z #define _T_PTRDIFF 2025-05-07T20:26:26.0730730Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp) 2025-05-07T20:26:26.0731082Z #define toascii(c) __toascii (c) 2025-05-07T20:26:26.0731344Z #define __GNUC__ 11 2025-05-07T20:26:26.0731695Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE 2025-05-07T20:26:26.0732135Z #define __GXX_RTTI 1 2025-05-07T20:26:26.0732385Z #define __pie__ 2 2025-05-07T20:26:26.0732598Z #define __MMX__ 1 2025-05-07T20:26:26.0732824Z #define __cudaCDP2Malloc 2025-05-07T20:26:26.0733076Z #define __timespec_defined 1 2025-05-07T20:26:26.0733328Z #define L_ctermid 9 2025-05-07T20:26:26.0733564Z #define __OFF64_T_TYPE __SQUAD_TYPE 2025-05-07T20:26:26.0733864Z #define __cudaCDP2GetParameterBufferV2 2025-05-07T20:26:26.0734262Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER) 2025-05-07T20:26:26.0734640Z #define _BITS_POSIX2_LIM_H 1 2025-05-07T20:26:26.0734902Z #define _GLIBCXX98_USE_C99_STDLIB 1 2025-05-07T20:26:26.0735195Z #define cudaMemAttachGlobal 0x01 2025-05-07T20:26:26.0735603Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp) 2025-05-07T20:26:26.0735920Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:26:26.0736204Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:26:26.0736653Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1) 2025-05-07T20:26:26.0737411Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:26:26.0738020Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE 2025-05-07T20:26:26.0738327Z #define __USE_SVID 1 2025-05-07T20:26:26.0738584Z #define __constant__ __location__(constant) 2025-05-07T20:26:26.0738895Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1 2025-05-07T20:26:26.0739197Z #define __device__ __location__(device) 2025-05-07T20:26:26.0739531Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1 2025-05-07T20:26:26.0739856Z #define _GLIBCXX_RES_LIMITS 1 2025-05-07T20:26:26.0740127Z #define M_1_PI 0.31830988618379067154 2025-05-07T20:26:26.0740420Z #define CUDART_DEVICE __device__ 2025-05-07T20:26:26.0740767Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW 2025-05-07T20:26:26.0741142Z #define M_PI_2 1.57079632679489661923 2025-05-07T20:26:26.0741437Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:26:26.0741811Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02 2025-05-07T20:26:26.0742191Z #define __STDC_UTF_16__ 1 2025-05-07T20:26:26.0742444Z #define LONG_MAX __LONG_MAX__ 2025-05-07T20:26:26.0742814Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136) 2025-05-07T20:26:26.0743237Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4 2025-05-07T20:26:26.0743555Z #define _POSIX_HOST_NAME_MAX 255 2025-05-07T20:26:26.0743851Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:26:26.0744145Z #define NGROUPS_MAX 65536 2025-05-07T20:26:26.0744402Z #define _GLIBCXX_NAMESPACE_LDBL 2025-05-07T20:26:26.0744669Z #define __USE_ISOC95 1 2025-05-07T20:26:26.0744891Z #define _TIME_H 1 2025-05-07T20:26:26.0745160Z #define M_LOG10El 0.434294481903251827651128918916605082L 2025-05-07T20:26:26.0745491Z #define __USE_ISOC99 1 2025-05-07T20:26:26.0745821Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname) 2025-05-07T20:26:26.0746191Z #define HOST_NAME_MAX 64 2025-05-07T20:26:26.0746448Z #define _POSIX_SEM_NSEMS_MAX 256 2025-05-07T20:26:26.0746713Z #define _IOS_ATEND 4 2025-05-07T20:26:26.0746950Z #define __SM_35_INTRINSICS_H__ 2025-05-07T20:26:26.0747280Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status)) 2025-05-07T20:26:26.0747689Z #define cudaStreamAttrValue cudaLaunchAttributeValue 2025-05-07T20:26:26.0748033Z #define _GLIBCXX_HAVE_S_ISREG 1 2025-05-07T20:26:26.0748319Z #define cudaSurfaceTypeCubemap 0x0C 2025-05-07T20:26:26.0748646Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:26:26.0748959Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:26:26.0749218Z #define _STDIO_H 1 2025-05-07T20:26:26.0749619Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type) 2025-05-07T20:26:26.0750100Z #define _GLIBCXX_PREDEFINED_OPS_H 1 2025-05-07T20:26:26.0750463Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:26:26.0750932Z #define _G_IO_IO_FILE_VERSION 0x20001 2025-05-07T20:26:26.0751232Z #define _POSIX_SIGQUEUE_MAX 32 2025-05-07T20:26:26.0751497Z #define _GLIBCXX_HAVE_GETS 1 2025-05-07T20:26:26.0751773Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1 2025-05-07T20:26:26.0752069Z #define __cpp_raw_strings 200710L 2025-05-07T20:26:26.0752370Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:26.0752691Z #define _GLIBCXX_HAVE_VFWSCANF 1 2025-05-07T20:26:26.0752969Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:26:26.0753247Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L 2025-05-07T20:26:26.0753557Z #define _GLIBCXX_STDIO_EOF -1 2025-05-07T20:26:26.0753837Z #define __SIZEOF_PTHREAD_MUTEX_T 40 2025-05-07T20:26:26.0754154Z #define __CHANNEL_DESCRIPTOR_H__ 2025-05-07T20:26:26.0754530Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8)) 2025-05-07T20:26:26.0754990Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:26:26.0755239Z #define __USE_XOPEN 1 2025-05-07T20:26:26.0755480Z #define __SIZEOF_PTHREAD_RWLOCK_T 56 2025-05-07T20:26:26.0755935Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:26:26.0756383Z #define __USE_XOPEN2K 1 2025-05-07T20:26:26.0756624Z #define _PSTL_UDR_PRESENT 1 2025-05-07T20:26:26.0756897Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:26:26.0757198Z #define _GLIBCXX_HAVE_COSF 1 2025-05-07T20:26:26.0757469Z #define __cpp_fold_expressions 201603L 2025-05-07T20:26:26.0757996Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2) 2025-05-07T20:26:26.0758528Z #define NL_LANGMAX _POSIX2_LINE_MAX 2025-05-07T20:26:26.0758812Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:26:26.0759175Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 2025-05-07T20:26:26.0759566Z #define __DADDR_T_TYPE __S32_TYPE 2025-05-07T20:26:26.0759955Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01 2025-05-07T20:26:26.0760349Z #define __END_NAMESPACE_C99 2025-05-07T20:26:26.0760626Z #define __glibcxx_integral_traps true 2025-05-07T20:26:26.0760923Z #define _POSIX_PATH_MAX 256 2025-05-07T20:26:26.0761177Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:26:26.0761436Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:26:26.0761703Z #define _IOS_TRUNC 16 2025-05-07T20:26:26.0761932Z #define _ISOC11_SOURCE 1 2025-05-07T20:26:26.0762184Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1 2025-05-07T20:26:26.0762480Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:26:26.0762777Z #define _GLIBCXX_HAVE_QUICK_EXIT 1 2025-05-07T20:26:26.0763147Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 2025-05-07T20:26:26.0763544Z #define LONG_MIN (-LONG_MAX - 1L) 2025-05-07T20:26:26.0763824Z #define _GLIBCXX_HAVE_SINCOSF 1 2025-05-07T20:26:26.0764086Z #define _IO_UNITBUF 020000 2025-05-07T20:26:26.0764340Z #define _GLIBCXX_HAVE_SINCOSL 1 2025-05-07T20:26:26.0764607Z #define __FD_SETSIZE 1024 2025-05-07T20:26:26.0764859Z #define getc(_fp) _IO_getc (_fp) 2025-05-07T20:26:26.0765133Z #define be32toh(x) __bswap_32 (x) 2025-05-07T20:26:26.0765480Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused" 2025-05-07T20:26:26.0765834Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:26:26.0766105Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:26:26.0766418Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l)) 2025-05-07T20:26:26.0766736Z #define _GLIBCXX_HAVE_GETIPINFO 1 2025-05-07T20:26:26.0767013Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:26:26.0767326Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l)) 2025-05-07T20:26:26.0767671Z #define _WCHAR_T_DEFINED_ 2025-05-07T20:26:26.0767957Z #define cudaIpcMemLazyEnablePeerAccess 0x01 2025-05-07T20:26:26.0768291Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1 2025-05-07T20:26:26.0768585Z #define __INO_T_MATCHES_INO64_T 1 2025-05-07T20:26:26.0768854Z #define __USE_POSIX199506 1 2025-05-07T20:26:26.0769109Z #define _FEATURES_H 1 2025-05-07T20:26:26.0769356Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:26:26.0769749Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM)) 2025-05-07T20:26:26.0770333Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8) 2025-05-07T20:26:26.0770672Z #define __stub_getmsg 2025-05-07T20:26:26.0770901Z #define _IO_FIXED 010000 2025-05-07T20:26:26.0771177Z #define __cpp_lib_addressof_constexpr 201603 2025-05-07T20:26:26.0771498Z #define _GLIBCXX11_USE_C99_STDIO 1 2025-05-07T20:26:26.0771777Z #define __stub_setlogin 2025-05-07T20:26:26.0772170Z #define __stub_fattach 2025-05-07T20:26:26.0772420Z #define __cplusplus 201703L 2025-05-07T20:26:26.0772693Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:26:26.0772973Z #define _STRUCT_TIMEVAL 1 2025-05-07T20:26:26.0773234Z #define INFINITY (__builtin_inff()) 2025-05-07T20:26:26.0773516Z #define _IO_UNBUFFERED 2 2025-05-07T20:26:26.0774049Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy 2025-05-07T20:26:26.0774685Z #define _IO_INTERNAL 010 2025-05-07T20:26:26.0774934Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:26:26.0775268Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue 2025-05-07T20:26:26.0775634Z #define __dev_t_defined 2025-05-07T20:26:26.0775876Z #define __DEPRECATED 1 2025-05-07T20:26:26.0776103Z #define __S32_TYPE int 2025-05-07T20:26:26.0776360Z #define __cpp_rvalue_references 200610L 2025-05-07T20:26:26.0776661Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:26:26.0776922Z #define _IO_fpos_t _G_fpos_t 2025-05-07T20:26:26.0777176Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:26:26.0777787Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout 2025-05-07T20:26:26.0778431Z #define _G_HAVE_MREMAP 1 2025-05-07T20:26:26.0778741Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:26:26.0779090Z #define OVERFLOW 3 2025-05-07T20:26:26.0779337Z #define __toascii_l(c,l) ((l), __toascii (c)) 2025-05-07T20:26:26.0779650Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:26:26.0779936Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:26.0780279Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11 2025-05-07T20:26:26.0780613Z #define __SSE2_MATH__ 1 2025-05-07T20:26:26.0780863Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:26:26.0781178Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:26.0781486Z #define _IO_STDIO_H 2025-05-07T20:26:26.0781731Z #define PDP_ENDIAN __PDP_ENDIAN 2025-05-07T20:26:26.0782026Z #define isspace_l(c,l) __isspace_l ((c), (l)) 2025-05-07T20:26:26.0782350Z #define __cudaCDP2Memcpy2DAsync 2025-05-07T20:26:26.0782649Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:26.0782967Z #define _GLIBCXX_HAVE_STRERROR_R 1 2025-05-07T20:26:26.0783237Z #define __amd64 1 2025-05-07T20:26:26.0783460Z #define _POSIX_TZNAME_MAX 6 2025-05-07T20:26:26.0783732Z #define __cudaCDP2Memset3DAsync 2025-05-07T20:26:26.0784017Z #define __SYSCALL_WORDSIZE 64 2025-05-07T20:26:26.0784358Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1 2025-05-07T20:26:26.0784670Z #define _EXT_TYPE_TRAITS 1 2025-05-07T20:26:26.0784941Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1 2025-05-07T20:26:26.0785245Z #define _POSIX_RE_DUP_MAX 255 2025-05-07T20:26:26.0785518Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:26:26.0785778Z #define __bounded 2025-05-07T20:26:26.0786009Z #define _GLIBCXX_HAVE_ACOSL 1 2025-05-07T20:26:26.0786280Z #define __USECONDS_T_TYPE __U32_TYPE 2025-05-07T20:26:26.0786576Z #define _IO_DELETE_DONT_CLOSE 0x40 2025-05-07T20:26:26.0786866Z #define __BEGIN_NAMESPACE_STD 2025-05-07T20:26:26.0787131Z #define _PTRDIFF_T_DECLARED 2025-05-07T20:26:26.0787411Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:26.0787738Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f) 2025-05-07T20:26:26.0788154Z #define cudaStreamAttributePriority cudaLaunchAttributePriority 2025-05-07T20:26:26.0788565Z #define _GLIBCXX_HAVE_NETDB_H 1 2025-05-07T20:26:26.0788840Z #define __SM_20_INTRINSICS_HPP__ 2025-05-07T20:26:26.0789183Z #define __cpp_lib_has_unique_object_representations 201606 2025-05-07T20:26:26.0789535Z #define STA_PLL 0x0001 2025-05-07T20:26:26.0789784Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:26:26.0790150Z #define __GNUG__ 11 2025-05-07T20:26:26.0790391Z #define _GLIBCXX_USE_GET_NPROCS 1 2025-05-07T20:26:26.0790660Z #define _T_WCHAR 2025-05-07T20:26:26.0790901Z #define __cudaCDP2GetDeviceCount 2025-05-07T20:26:26.0791188Z #define __specialization_static 2025-05-07T20:26:26.0791497Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:26:26.0791819Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:26:26.0792079Z #define cudaArraySparse 0x40 2025-05-07T20:26:26.0792354Z #define STA_PPSFREQ 0x0002 2025-05-07T20:26:26.0792640Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_)) 2025-05-07T20:26:26.0792940Z #define _WCHAR_T 2025-05-07T20:26:26.0793168Z #define __cudaCDP2Free 2025-05-07T20:26:26.0793824Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0) 2025-05-07T20:26:26.0794684Z #define __cpp_nsdmi 200809L 2025-05-07T20:26:26.0795113Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0) 2025-05-07T20:26:26.0795561Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:26:26.0795844Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:26:26.0796108Z #define cudaArrayCubemap 0x04 2025-05-07T20:26:26.0796449Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:26:26.0796806Z #define _GLIBCXX_UTILITY 1 2025-05-07T20:26:26.0797048Z #define __NO_CTYPE 1 2025-05-07T20:26:26.0797279Z #define __stub_bdflush 2025-05-07T20:26:26.0797647Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter) 2025-05-07T20:26:26.0806041Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 2025-05-07T20:26:26.0806732Z #define _GLIBCXX_STDC_HEADERS 1 2025-05-07T20:26:26.0807042Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:26:26.0807323Z #define __cpp_initializer_lists 200806L 2025-05-07T20:26:26.0807638Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1 2025-05-07T20:26:26.0807939Z #define __U16_TYPE unsigned short int 2025-05-07T20:26:26.0808297Z #define __glibcxx_requires_can_increment(_First,_Size) 2025-05-07T20:26:26.0808643Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1 2025-05-07T20:26:26.0808934Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:26:26.0809212Z #define cudaHostRegisterIoMemory 0x04 2025-05-07T20:26:26.0809564Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS)) 2025-05-07T20:26:26.0809919Z #define __cpp_lib_is_invocable 201703 2025-05-07T20:26:26.0810206Z #define _IO_STDIO 040000 2025-05-07T20:26:26.0810538Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int))) 2025-05-07T20:26:26.0810933Z #define cudaSurfaceType1DLayered 0xF1 2025-05-07T20:26:26.0811258Z #define cudaArraySurfaceLoadStore 0x02 2025-05-07T20:26:26.0811549Z #define _PTRDIFF_T 2025-05-07T20:26:26.0811778Z #define _MOVE_H 1 2025-05-07T20:26:26.0812106Z #define __cpp_hex_float 201603L 2025-05-07T20:26:26.0812368Z #define ADJ_TAI 0x0080 2025-05-07T20:26:26.0812603Z #define __ptrvalue 2025-05-07T20:26:26.0812836Z #define _GLIBCXX_HOSTED 1 2025-05-07T20:26:26.0813090Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:26:26.0813382Z #define __WTERMSIG(status) ((status) & 0x7f) 2025-05-07T20:26:26.0813692Z #define MATH_ERREXCEPT 2 2025-05-07T20:26:26.0813944Z #define _GLIBCXX_HAS_GTHREADS 1 2025-05-07T20:26:26.0814240Z #define cudaTextureType2DLayered 0xF2 2025-05-07T20:26:26.0814646Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0)) 2025-05-07T20:26:26.0815027Z #define __USE_GNU 1 2025-05-07T20:26:26.0815262Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:26:26.0815542Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:26:26.0815816Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:26:26.0816201Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d))) 2025-05-07T20:26:26.0816598Z #define WEXITED 4 2025-05-07T20:26:26.0816815Z #define _IO_NO_READS 4 2025-05-07T20:26:26.0817112Z #define cudaGraphKernelNodePortLaunchCompletion 2 2025-05-07T20:26:26.0817464Z #define M_LOG2E 1.4426950408889634074 2025-05-07T20:26:26.0818060Z #define _POSIX_SYMLINK_MAX 255 2025-05-07T20:26:26.0818361Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1 2025-05-07T20:26:26.0818681Z #define __uid_t_defined 2025-05-07T20:26:26.0818934Z #define __FD_ELT(d) ((d) / __NFDBITS) 2025-05-07T20:26:26.0819216Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1 2025-05-07T20:26:26.0819491Z #define WNOHANG 1 2025-05-07T20:26:26.0819740Z #define alloca(size) __builtin_alloca (size) 2025-05-07T20:26:26.0820048Z #define _GLIBCXX_HAVE_HYPOTF 1 2025-05-07T20:26:26.0820326Z #define cudaEventDefault 0x00 2025-05-07T20:26:26.0820632Z #define __maxnreg__(a) __attribute__((maxnreg(a))) 2025-05-07T20:26:26.0820958Z #define NL_SETMAX INT_MAX 2025-05-07T20:26:26.0821194Z #define __x86_64 1 2025-05-07T20:26:26.0821433Z #define __cudaCDP2LaunchDevice 2025-05-07T20:26:26.0821991Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:26.0822472Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 { 2025-05-07T20:26:26.0822987Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:26:26.0823429Z #define __PTRDIFF_T 2025-05-07T20:26:26.0823753Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW 2025-05-07T20:26:26.0824141Z #define _GLIBCXX_HAVE_FINITEL 1 2025-05-07T20:26:26.0824427Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:26.0824723Z #define _Mlong_double_ long double 2025-05-07T20:26:26.0825003Z #define __cpp_lambdas 200907L 2025-05-07T20:26:26.0825266Z #define _IO_DEC 020 2025-05-07T20:26:26.0825503Z #define _GLIBCXX_HAVE_SINHL 1 2025-05-07T20:26:26.0825773Z #define _POSIX_CLOCKRES_MIN 20000000 2025-05-07T20:26:26.0826069Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:26:26.0826359Z #define ADJ_TIMECONST 0x0020 2025-05-07T20:26:26.0826626Z #define _GLIBCXX_HAVE_SQRTL 1 2025-05-07T20:26:26.0826931Z #define __cudaCDP2DeviceGetSharedMemConfig 2025-05-07T20:26:26.0827271Z #define _GLIBCXX_HAVE_STDALIGN_H 1 2025-05-07T20:26:26.0827549Z #define _ANSI_STDDEF_H 2025-05-07T20:26:26.0827837Z #define _GLIBCXX_MOVE(__val) std::move(__val) 2025-05-07T20:26:26.0828159Z #define _GLIBCXX_HAVE_STRERROR_L 1 2025-05-07T20:26:26.0828528Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:26:26.0828920Z #define _GLIBCXX_USE_DEV_RANDOM 1 2025-05-07T20:26:26.0829210Z #define _STL_ITERATOR_BASE_TYPES_H 1 2025-05-07T20:26:26.0829509Z #define __cpp_template_auto 201606L 2025-05-07T20:26:26.0829866Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:26:26.0830243Z #define _GLIBCXX_HAVE_SYS_SEM_H 1 2025-05-07T20:26:26.0830519Z #define __key_t_defined 2025-05-07T20:26:26.0830768Z #define _IO_MAGIC_MASK 0xFFFF0000 2025-05-07T20:26:26.0831145Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__))) 2025-05-07T20:26:26.0831631Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:26:26.0831999Z #define __GNUC_VA_LIST 2025-05-07T20:26:26.0832347Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:26:26.0832740Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:26:26.0833013Z #define CLOCK_REALTIME_COARSE 5 2025-05-07T20:26:26.0833294Z #define _GLIBCXX14_CONSTEXPR constexpr 2025-05-07T20:26:26.0833594Z #define __USE_XOPEN2KXSI 1 2025-05-07T20:26:26.0833850Z #define __WCOREFLAG 0x80 2025-05-07T20:26:26.0834103Z #define M_2_SQRTPI 1.12837916709551257390 2025-05-07T20:26:26.0834413Z #define cudaEventDisableTiming 0x02 2025-05-07T20:26:26.0834696Z #define __LP64__ 1 2025-05-07T20:26:26.0834943Z #define __isascii_l(c,l) ((l), __isascii (c)) 2025-05-07T20:26:26.0835270Z #define cudaStreamNonBlocking 0x01 2025-05-07T20:26:26.0835559Z #define _IO_off64_t __off64_t 2025-05-07T20:26:26.0835822Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:26:26.0836093Z #define __time_t_defined 1 2025-05-07T20:26:26.0836354Z #define _POSIX_SYMLOOP_MAX 8 2025-05-07T20:26:26.0836800Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:26:26.0837177Z #define __USE_UNIX98 1 2025-05-07T20:26:26.0837425Z #define __MODE_T_TYPE __U32_TYPE 2025-05-07T20:26:26.0837702Z #define CLOCK_REALTIME_ALARM 8 2025-05-07T20:26:26.0837971Z #define _GLIBCXX_HAVE_STRINGS_H 1 2025-05-07T20:26:26.0838273Z #define __LEAF_ATTR __attribute__ ((__leaf__)) 2025-05-07T20:26:26.0838591Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:26:26.0838847Z #define SEEK_CUR 1 2025-05-07T20:26:26.0839086Z #define __RLIM64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:26.0839361Z #define _ASSERT_H 1 2025-05-07T20:26:26.0839939Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig)) 2025-05-07T20:26:26.0840586Z #define _GLIBCXX_USE_DEPRECATED 1 2025-05-07T20:26:26.0840957Z #define CHAR_MAX SCHAR_MAX 2025-05-07T20:26:26.0841217Z #define _GLIBCXX_HAVE_SETENV 1 2025-05-07T20:26:26.0841480Z #define NL_ARGMAX _POSIX_ARG_MAX 2025-05-07T20:26:26.0841756Z #define _GLIBCXX_USE_UTIMENSAT 1 2025-05-07T20:26:26.0842145Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:26:26.0842558Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 2025-05-07T20:26:26.0843234Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch))) 2025-05-07T20:26:26.0843926Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1 2025-05-07T20:26:26.0844259Z #define _IO_BOOLALPHA 0200000 2025-05-07T20:26:26.0844611Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912) 2025-05-07T20:26:26.0844994Z #define _GLIBCXX_PACKAGE_URL "" 2025-05-07T20:26:26.0845269Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:26:26.0845552Z #define cudaArrayDefault 0x00 2025-05-07T20:26:26.0845843Z #define __cudaCDP2LaunchDeviceV2 2025-05-07T20:26:26.0846138Z #define __FDS_BITS(set) ((set)->fds_bits) 2025-05-07T20:26:26.0846418Z #define TLOSS 5 2025-05-07T20:26:26.0846648Z #define __ssize_t_defined 2025-05-07T20:26:26.0846905Z #define __CUDACC_VER_BUILD__ 61 2025-05-07T20:26:26.0847177Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL) 2025-05-07T20:26:26.0847475Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:26:26.0847763Z #define _POSIX_HIWAT _POSIX_PIPE_BUF 2025-05-07T20:26:26.0848044Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:26:26.0848336Z #define __cudaCDP2EventRecordWithFlags 2025-05-07T20:26:26.0848650Z #define _GLIBCXX_ATOMIC_BUILTINS 1 2025-05-07T20:26:26.0848946Z #define cudaPeerAccessDefault 0x00 2025-05-07T20:26:26.0849235Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1 2025-05-07T20:26:26.0849529Z #define __REGISTER_PREFIX__ 2025-05-07T20:26:26.0849792Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:26:26.0850126Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 2025-05-07T20:26:26.0850497Z #define _IOS_NOREPLACE 64 2025-05-07T20:26:26.0850742Z #define __cdecl 2025-05-07T20:26:26.0850979Z #define cudaEventInterprocess 0x04 2025-05-07T20:26:26.0851321Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L 2025-05-07T20:26:26.0851654Z #define LOGIN_NAME_MAX 256 2025-05-07T20:26:26.0851909Z #define _IO_TIED_PUT_GET 0x400 2025-05-07T20:26:26.0852318Z #define X_TLOSS 1.41484755040568800000e+16 2025-05-07T20:26:26.0852617Z #define CUDA_IPC_HANDLE_SIZE 64 2025-05-07T20:26:26.0852884Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:26:26.0853202Z #define __attribute_pure__ __attribute__ ((__pure__)) 2025-05-07T20:26:26.0853543Z #define __TEXTURE_TYPES_H__ 2025-05-07T20:26:26.0853963Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:26:26.0854453Z #define ADJ_NANO 0x2000 2025-05-07T20:26:26.0854767Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:26:26.0855133Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:26:26.0855426Z #define _GLIBCXX_HAVE_ISWBLANK 1 2025-05-07T20:26:26.0855698Z #define __FLT_DIG__ 6 2025-05-07T20:26:26.0856147Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias) 2025-05-07T20:26:26.0856553Z #define __NO_INLINE__ 1 2025-05-07T20:26:26.0856865Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:26:26.0857223Z #define _POSIX_NGROUPS_MAX 8 2025-05-07T20:26:26.0857490Z #define ADJ_STATUS 0x0010 2025-05-07T20:26:26.0857754Z #define __cudaCDP2MemcpyAsync_ptsz 2025-05-07T20:26:26.0858050Z #define CLOCK_BOOTTIME_ALARM 9 2025-05-07T20:26:26.0858329Z #define LONG_LONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:26:26.0858628Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1 2025-05-07T20:26:26.0858928Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:26:26.0859317Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000 2025-05-07T20:26:26.0859738Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1 2025-05-07T20:26:26.0860092Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:26:26.0860535Z #define CHAR_MIN SCHAR_MIN 2025-05-07T20:26:26.0860778Z #define MAX_CANON 255 2025-05-07T20:26:26.0861012Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:26:26.0861275Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:26:26.0861543Z #define _GLIBCXX_HAVE_COMPLEX_H 1 2025-05-07T20:26:26.0861835Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 2025-05-07T20:26:26.0862146Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX 2025-05-07T20:26:26.0862451Z #define _GLIBCXX_HAVE_HYPOT 1 2025-05-07T20:26:26.0862728Z #define __cudaCDP2Memset2DAsync_ptsz 2025-05-07T20:26:26.0863058Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1 2025-05-07T20:26:26.0863374Z #define __VERSION__ "11.4.0" 2025-05-07T20:26:26.0863633Z #define _GLIBCXX11_USE_C99_STDLIB 1 2025-05-07T20:26:26.0863934Z #define cudaHostRegisterMapped 0x02 2025-05-07T20:26:26.0864230Z #define _GLIBCXX_HAVE_INT64_T 1 2025-05-07T20:26:26.0864508Z #define _GLIBCXX_USE_CONSTEXPR constexpr 2025-05-07T20:26:26.0864836Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp) 2025-05-07T20:26:26.0865137Z #define __UINT64_C(c) c ## UL 2025-05-07T20:26:26.0865396Z #define MOD_OFFSET ADJ_OFFSET 2025-05-07T20:26:26.0865654Z #define _SYS_TYPES_H 1 2025-05-07T20:26:26.0865905Z #define AIO_PRIO_DELTA_MAX 20 2025-05-07T20:26:26.0866166Z #define _GLIBCXX_HAVE_TANHF 1 2025-05-07T20:26:26.0866441Z #define _SYS_CDEFS_H 1 2025-05-07T20:26:26.0866684Z #define _GLIBCXX_HAVE_TANHL 1 2025-05-07T20:26:26.0866961Z #define __cpp_unicode_characters 201411L 2025-05-07T20:26:26.0867259Z #define _IO_ERR_SEEN 0x20 2025-05-07T20:26:26.0867519Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1 2025-05-07T20:26:26.0867820Z #define __cudaCDP2StreamDestroy 2025-05-07T20:26:26.0868090Z #define FP_SUBNORMAL 3 2025-05-07T20:26:26.0868347Z #define cudaOccupancyDefault 0x00 2025-05-07T20:26:26.0868637Z #define _INITIALIZER_LIST 2025-05-07T20:26:26.0868887Z #define _STDC_PREDEF_H 1 2025-05-07T20:26:26.0869154Z #define _GLIBCXX_PACKAGE_BUGREPORT "" 2025-05-07T20:26:26.0869456Z #define _GLIBCXX_HAVE_MODF 1 2025-05-07T20:26:26.0869717Z #define _IO_file_flags _flags 2025-05-07T20:26:26.0869982Z #define __USE_XOPEN2K8 1 2025-05-07T20:26:26.0870239Z #define htobe64(x) __bswap_64 (x) 2025-05-07T20:26:26.0870521Z #define _OLD_STDIO_MAGIC 0xFABC0000 2025-05-07T20:26:26.0870802Z #define HUGE 3.40282347e+38F 2025-05-07T20:26:26.0871074Z #define __cpp_lib_is_null_pointer 201309 2025-05-07T20:26:26.0871448Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status)) 2025-05-07T20:26:26.0871849Z #define islower_l(c,l) __islower_l ((c), (l)) 2025-05-07T20:26:26.0872167Z #define _GLIBCXX_USE_CXX11_ABI 1 2025-05-07T20:26:26.0872441Z #define _GLIBCXX_HAVE_SYMLINK 1 2025-05-07T20:26:26.0872700Z #define _BSD_SOURCE 1 2025-05-07T20:26:26.0872939Z #define _GLIBCXX_THROW(_EXC) 2025-05-07T20:26:26.0873795Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template> struct __has_ ##_NTYPE : false_type { }; template struct __has_ ##_NTYPE<_Tp, __void_t> : true_type { }; 2025-05-07T20:26:26.0874669Z #define __catch(X) catch(X) 2025-05-07T20:26:26.0874934Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:26:26.0875229Z #define LINE_MAX _POSIX2_LINE_MAX 2025-05-07T20:26:26.0875640Z #define __TIMER_T_TYPE void * 2025-05-07T20:26:26.0875902Z #define __STRING(x) #x 2025-05-07T20:26:26.0876149Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:26:26.0876427Z #define _T_PTRDIFF_ 2025-05-07T20:26:26.0876669Z #define _GLIBCXX_USE_NOEXCEPT noexcept 2025-05-07T20:26:26.0876979Z #define cudaEventWaitExternal 0x01 2025-05-07T20:26:26.0877260Z #define __unbounded 2025-05-07T20:26:26.0877501Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:26.0877794Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:26:26.0878077Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:26.0878378Z #define be16toh(x) __bswap_16 (x) 2025-05-07T20:26:26.0878660Z #define __cpp_lib_is_final 201402L 2025-05-07T20:26:26.0878961Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 2025-05-07T20:26:26.0879377Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL) 2025-05-07T20:26:26.0879691Z #define __MATH_DECLARE_LDOUBLE 1 2025-05-07T20:26:26.0879980Z #define __managed__ __location__(managed) 2025-05-07T20:26:26.0880282Z #define _POSIX2_EXPR_NEST_MAX 32 2025-05-07T20:26:26.0880686Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:26:26.0881116Z #define _POSIX_STREAM_MAX 8 2025-05-07T20:26:26.0881381Z #define __LIBRARY_TYPES_H__ 2025-05-07T20:26:26.0881758Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11 2025-05-07T20:26:26.0882171Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:26:26.0882429Z #define _SYS_SIZE_T_H 2025-05-07T20:26:26.0882718Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10) 2025-05-07T20:26:26.0883062Z #define _GLIBCXX_STDLIB_H 1 2025-05-07T20:26:26.0883347Z #define isupper_l(c,l) __isupper_l ((c), (l)) 2025-05-07T20:26:26.0883639Z #define _CRTIMP 2025-05-07T20:26:26.0883867Z #define _GLIBCXX_CXX_CONFIG_H 1 2025-05-07T20:26:26.0884189Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:26:26.0884515Z #define STA_PPSJITTER 0x0200 2025-05-07T20:26:26.0884882Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0) 2025-05-07T20:26:26.0885305Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:26.0885630Z #define _GLIBCXX_HAVE_ISINFF 1 2025-05-07T20:26:26.0885909Z #define __glibcxx_requires_subscript(_N) 2025-05-07T20:26:26.0886202Z #define __SIZE_T__ 2025-05-07T20:26:26.0886422Z #define __stub_gtty 2025-05-07T20:26:26.0886649Z #define __pid_t_defined 2025-05-07T20:26:26.0886916Z #define _GLIBCXX_FWDREF(_Tp) _Tp&& 2025-05-07T20:26:26.0887226Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:26.0887538Z #define __glibcxx_function_requires(...) 2025-05-07T20:26:26.0887884Z #define __SM_80_RT_HPP__ 2025-05-07T20:26:26.0888205Z #define __need_clockid_t 2025-05-07T20:26:26.0888448Z #define SSIZE_MAX LONG_MAX 2025-05-07T20:26:26.0888717Z #define _GLIBCXX_HAVE_USELOCALE 1 2025-05-07T20:26:26.0889041Z #define __glibcxx_requires_string_len(_String,_Len) 2025-05-07T20:26:26.0889364Z #define _IO_HEX 0100 2025-05-07T20:26:26.0889623Z #define __NFDBITS (8 * (int) sizeof (__fd_mask)) 2025-05-07T20:26:26.0889964Z #define cudaExternalMemoryDedicated 0x1 2025-05-07T20:26:26.0890063Z #define _GLIBCXX_HAVE_TGMATH_H 1 2025-05-07T20:26:26.0890177Z #define _GLIBCXX11_USE_C99_COMPLEX 1 2025-05-07T20:26:26.0890402Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:26.0890520Z #define ispunct_l(c,l) __ispunct_l ((c), (l)) 2025-05-07T20:26:26.0890633Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:26:26.0890733Z #define __cudaGet_blockDim() blockDim 2025-05-07T20:26:26.0890839Z #define __cudaCDP2Memcpy3DAsync 2025-05-07T20:26:26.0890949Z #define __cudaCDP2MemcpyAsync 2025-05-07T20:26:26.0891033Z #define __stub_sstk 2025-05-07T20:26:26.0891126Z #define _IO_IN_BACKUP 0x100 2025-05-07T20:26:26.0891289Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB 2025-05-07T20:26:26.0891377Z #define __wur 2025-05-07T20:26:26.0891500Z #define isprint_l(c,l) __isprint_l ((c), (l)) 2025-05-07T20:26:26.0891587Z #define _G_HAVE_MMAP 1 2025-05-07T20:26:26.0891766Z #define _IO_OCT 040 2025-05-07T20:26:26.0891866Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:26:26.0892068Z #define NL_MSGMAX INT_MAX 2025-05-07T20:26:26.0892160Z #define _GLIBCXX_USE_LFS 1 2025-05-07T20:26:26.0892295Z #define cudaDeviceScheduleBlockingSync 0x04 2025-05-07T20:26:26.0892387Z #define _POSIX_RTSIG_MAX 8 2025-05-07T20:26:26.0892489Z #define _GLIBCXX_NOEXCEPT noexcept 2025-05-07T20:26:26.0892686Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 2025-05-07T20:26:26.0892781Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:26:26.0892878Z #define _STL_ALGOBASE_H 1 2025-05-07T20:26:26.0892987Z #define __cudaCDP2MemsetAsync_ptsz 2025-05-07T20:26:26.0893075Z #define __off64_t_defined 2025-05-07T20:26:26.0893184Z #define _GLIBCXX_WEAK_DEFINITION 2025-05-07T20:26:26.0893360Z #define __FLT128_DIG__ 33 2025-05-07T20:26:26.0893466Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1 2025-05-07T20:26:26.0893569Z #define _GLIBCXX_HAVE_LOCALE_H 1 2025-05-07T20:26:26.0893660Z #define __INT32_C(c) c 2025-05-07T20:26:26.0893755Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:26:26.0893873Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:26:26.0893979Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:26:26.0894089Z #define __PDP_ENDIAN 3412 2025-05-07T20:26:26.0894185Z #define _ISOC95_SOURCE 1 2025-05-07T20:26:26.0894280Z #define _IO_fpos64_t _G_fpos64_t 2025-05-07T20:26:26.0894419Z #define M_PI_2l 1.570796326794896619231321691639751442L 2025-05-07T20:26:26.0894513Z #define BYTE_ORDER __BYTE_ORDER 2025-05-07T20:26:26.0894602Z #define __SM_90_RT_HPP__ 2025-05-07T20:26:26.0894709Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:26:26.0894805Z #define __have_pthread_attr_t 1 2025-05-07T20:26:26.0894904Z #define _GLIBCXX_HAVE_LIMIT_DATA 1 2025-05-07T20:26:26.0895142Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11 2025-05-07T20:26:26.0895250Z #define __cudaCDP2StreamWaitEvent 2025-05-07T20:26:26.0895353Z #define __cudaCDP2EventRecord 2025-05-07T20:26:26.0895460Z #define _BITS_TYPESIZES_H 1 2025-05-07T20:26:26.0895545Z #define htole32(x) (x) 2025-05-07T20:26:26.0895800Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 2025-05-07T20:26:26.0895930Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE 2025-05-07T20:26:26.0896030Z #define _GLIBCXX_USE_C99_MATH_TR1 1 2025-05-07T20:26:26.0896193Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status)) 2025-05-07T20:26:26.0896333Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH 2025-05-07T20:26:26.0896458Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:26:26.0896605Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0) 2025-05-07T20:26:26.0896695Z #define ADJ_OFFSET 0x0001 2025-05-07T20:26:26.0896797Z #define cudaArrayLayered 0x01 2025-05-07T20:26:26.0896981Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800) 2025-05-07T20:26:26.0897091Z #define cudaEventRecordDefault 0x00 2025-05-07T20:26:26.0897185Z #define _GLIBCXX_HAVE_FMODF 1 2025-05-07T20:26:26.0897298Z #define _PSTL_PRAGMA_MESSAGE(x) 2025-05-07T20:26:26.0897381Z #define unix 1 2025-05-07T20:26:26.0897484Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:26:26.0897576Z #define _POSIX_CHILD_MAX 25 2025-05-07T20:26:26.0897670Z #define _POSIX_MAX_INPUT 255 2025-05-07T20:26:26.0897793Z #define __cudaCDP2DeviceGetCacheConfig 2025-05-07T20:26:26.0897879Z #define __USE_POSIX 1 2025-05-07T20:26:26.0897974Z #define __FD_ZERO_STOS "stosq" 2025-05-07T20:26:26.0898112Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000) 2025-05-07T20:26:26.0898204Z #define __THROWNL throw () 2025-05-07T20:26:26.0898295Z #define __cpp_rtti 199711L 2025-05-07T20:26:26.0898404Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:26:26.0898492Z #define __PMT(args) args 2025-05-07T20:26:26.0898644Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:26.0898848Z #define __va_arg_pack_len() __builtin_va_arg_pack_len () 2025-05-07T20:26:26.0898965Z #define __ULONGWORD_TYPE unsigned long int 2025-05-07T20:26:26.0899061Z #define _SIZE_T_DECLARED 2025-05-07T20:26:26.0899250Z #define _PSTL_STRING_AUX(x) #x 2025-05-07T20:26:26.0899343Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:26:26.0899753Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402) 2025-05-07T20:26:26.0899851Z #define _GLIBCXX_HAVE_LIMIT_AS 1 2025-05-07T20:26:26.0899944Z #define XATTR_LIST_MAX 65536 2025-05-07T20:26:26.0900044Z #define __CUDACC_VER_MAJOR__ 12 2025-05-07T20:26:26.0900186Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:26:26.0900270Z #define _WCHAR_T_H 2025-05-07T20:26:26.0900364Z #define __FLT64X_DIG__ 18 2025-05-07T20:26:26.0900453Z #define _IO_SHOWBASE 0200 2025-05-07T20:26:26.0900547Z #define _POSIX_QLIMIT 1 2025-05-07T20:26:26.0900733Z #define __INT8_TYPE__ signed char 2025-05-07T20:26:26.0900829Z #define __SURFACE_TYPES_H__ 2025-05-07T20:26:26.0900921Z #define __CUDA_ARCH__ 520 2025-05-07T20:26:26.0901029Z #define __cpp_digit_separators 201309L 2025-05-07T20:26:26.0901115Z #define __ELF__ 1 2025-05-07T20:26:26.0901220Z #define CLOCK_THREAD_CPUTIME_ID 3 2025-05-07T20:26:26.0901318Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:26:26.0901408Z #define STA_INS 0x0010 2025-05-07T20:26:26.0901521Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:26:26.0901693Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)]) 2025-05-07T20:26:26.0901787Z #define _BITS_BYTESWAP_H 1 2025-05-07T20:26:26.0901887Z #define __ID_T_TYPE __U32_TYPE 2025-05-07T20:26:26.0901998Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:26.0902112Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 2025-05-07T20:26:26.0902209Z #define _GLIBCXX_HAVE_MBSTATE_T 1 2025-05-07T20:26:26.0902313Z #define __cpp_lib_logical_traits 201510 2025-05-07T20:26:26.0902416Z #define ADJ_OFFSET_SS_READ 0xa001 2025-05-07T20:26:26.0902578Z #define __warnattr(msg) __attribute__((__warning__ (msg))) 2025-05-07T20:26:26.0902739Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: " 2025-05-07T20:26:26.0902849Z #define _IO_funlockfile(_fp) 2025-05-07T20:26:26.0903179Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:26:26.0903317Z #define M_2_PIl 0.636619772367581343075535053490057448L 2025-05-07T20:26:26.0903411Z #define __DRIVER_TYPES_H__ 2025-05-07T20:26:26.0903497Z #define __FLT_RADIX__ 2 2025-05-07T20:26:26.0903603Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:26:26.0903772Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:26:26.0903866Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:26:26.0903965Z #define _GLIBCXX_USE_LSTAT 1 2025-05-07T20:26:26.0904086Z #define minor(dev) gnu_dev_minor (dev) 2025-05-07T20:26:26.0904190Z #define _POSIX_C_SOURCE 200809L 2025-05-07T20:26:26.0904313Z #define _GLIBCXX_HAVE_DIRENT_H 1 2025-05-07T20:26:26.0904420Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:26:26.0904503Z #define WORD_BIT 32 2025-05-07T20:26:26.0904596Z #define _IO_USER_BUF 1 2025-05-07T20:26:26.0904692Z #define __VECTOR_TYPES_H__ 2025-05-07T20:26:26.0904803Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:26.0904912Z #define cudaHostAllocPortable 0x01 2025-05-07T20:26:26.0905010Z #define PTHREAD_STACK_MIN 16384 2025-05-07T20:26:26.0905117Z #define __long_double_t long double 2025-05-07T20:26:26.0905212Z #define _GLIBCXX_HAVE_ISINF 1 2025-05-07T20:26:26.0905304Z #define _POSIX_ARG_MAX 4096 2025-05-07T20:26:26.0905717Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode 2025-05-07T20:26:26.0905801Z #define __k8 1 2025-05-07T20:26:26.0905998Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23) 2025-05-07T20:26:26.0906563Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:26:26.0906703Z #define __LDBL_REDIR(name,proto) name proto 2025-05-07T20:26:26.0906811Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:26:26.0906910Z #define __SM_30_INTRINSICS_HPP__ 2025-05-07T20:26:26.0907248Z #define _GLIBCXX_EXTERN_TEMPLATE 1 2025-05-07T20:26:26.0907353Z #define __blksize_t_defined 2025-05-07T20:26:26.0907446Z #define _IO_SHOWPOINT 0400 2025-05-07T20:26:26.0907545Z #define _GLIBCXX_HAVE_LIMIT_RSS 1 2025-05-07T20:26:26.0907669Z #define cudaDeviceLmemResizeToMax 0x10 2025-05-07T20:26:26.0907762Z #define _GLIBCXX_X86_RDRAND 1 2025-05-07T20:26:26.0907869Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:26:26.0907972Z #define _IO_IS_FILEBUF 0x2000 2025-05-07T20:26:26.0908066Z #define _GLIBCXX_USE_DUAL_ABI 1 2025-05-07T20:26:26.0908324Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8))) 2025-05-07T20:26:26.0908681Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2) 2025-05-07T20:26:26.0908782Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1) 2025-05-07T20:26:26.0909024Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:26:26.0909108Z #define SEEK_SET 0 2025-05-07T20:26:26.0909206Z #define _GLIBCXX_TR1_GAMMA_TCC 1 2025-05-07T20:26:26.0909316Z #define __CUDA_API_VER_MINOR__ 8 2025-05-07T20:26:26.0909512Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V))) 2025-05-07T20:26:26.0909615Z #define __cudaCDP2GetLastError 2025-05-07T20:26:26.0909716Z #define _GLIBCXX_HAVE_COSL 1 2025-05-07T20:26:26.0909806Z #define _MATH_H_MATHDEF 1 2025-05-07T20:26:26.0910131Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24)) 2025-05-07T20:26:26.0910239Z #define _GLIBCXX_USE_FLOAT128 1 2025-05-07T20:26:26.0910336Z #define _IO_FLAGS2_NOTCANCEL 2 2025-05-07T20:26:26.0910433Z #define __stub_sigreturn 2025-05-07T20:26:26.0910675Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg))) 2025-05-07T20:26:26.0910780Z #define _GLIBCXX_HAVE_UTIME_H 1 2025-05-07T20:26:26.0910878Z #define __HOST_CONFIG_H__ 2025-05-07T20:26:26.0910975Z #define _XOPEN_SOURCE_EXTENDED 1 2025-05-07T20:26:26.0911060Z #define CLOCK_TAI 11 2025-05-07T20:26:26.0911179Z #define _GLIBCXX_END_NAMESPACE_VERSION 2025-05-07T20:26:26.0911391Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 2025-05-07T20:26:26.0911478Z #define __restrict_arr 2025-05-07T20:26:26.0911595Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 2025-05-07T20:26:26.0911736Z #define __glibcxx_requires_valid_range(_First,_Last) 2025-05-07T20:26:26.0912282Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:26:26.0912469Z #define __attribute_artificial__ __attribute__ ((__artificial__)) 2025-05-07T20:26:26.0912555Z #define __USE_MISC 1 2025-05-07T20:26:26.0912667Z #define __UWORD_TYPE unsigned long int 2025-05-07T20:26:26.0912769Z #define _EXCEPTION_DEFINES_H 1 2025-05-07T20:26:26.0912857Z #define _GCC_LIMITS_H_ 2025-05-07T20:26:26.0912950Z #define __LDBL_DIG__ 18 2025-05-07T20:26:26.0913050Z #define __BIT_TYPES_DEFINED__ 1 2025-05-07T20:26:26.0913159Z #define __malloc_and_calloc_defined 2025-05-07T20:26:26.0913250Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:26:26.0913351Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1 2025-05-07T20:26:26.0913439Z #define __x86_64__ 1 2025-05-07T20:26:26.0913520Z #define _SIZE_T_ 2025-05-07T20:26:26.0914430Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56))) 2025-05-07T20:26:26.0914538Z #define _POSIX2_COLL_WEIGHTS_MAX 2 2025-05-07T20:26:26.0914636Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:26:26.0914758Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1 2025-05-07T20:26:26.0914881Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:26:26.0915107Z #define _IO_iconv_t _G_iconv_t 2025-05-07T20:26:26.0915224Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1 2025-05-07T20:26:26.0915346Z #define __cpp_lib_make_reverse_iterator 201402 2025-05-07T20:26:26.0915486Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 2025-05-07T20:26:26.0915589Z #define _GLIBCXX_HAVE_DLFCN_H 1 2025-05-07T20:26:26.0916066Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:26:26.0916198Z #define __no_return__ __attribute__((noreturn)) 2025-05-07T20:26:26.0916347Z #define __device_builtin__ __location__(device_builtin) 2025-05-07T20:26:26.0916449Z #define _PSTL_HIDE_FROM_ABI_POP 2025-05-07T20:26:26.0916551Z #define _GLIBCXX_HAVE_ACOSF 1 2025-05-07T20:26:26.0916719Z #define STA_FLL 0x0008 2025-05-07T20:26:26.0916862Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1 2025-05-07T20:26:26.0916965Z #define _GLIBCXX_END_EXTERN_C } 2025-05-07T20:26:26.0917092Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:26.0917204Z #define __cpp_lib_integer_sequence 201304 2025-05-07T20:26:26.0917293Z #define __stub_revoke 2025-05-07T20:26:26.0917383Z #define __timer_t_defined 1 2025-05-07T20:26:26.0917516Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED 2025-05-07T20:26:26.0917612Z #define INT_MAX __INT_MAX__ 2025-05-07T20:26:26.0917717Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1) 2025-05-07T20:26:26.0917828Z #define _GLIBCXX_END_NAMESPACE_CXX11 } 2025-05-07T20:26:26.0917922Z #define _GLIBCXX_ICONV_CONST 2025-05-07T20:26:26.0918024Z #define major(dev) gnu_dev_major (dev) 2025-05-07T20:26:26.0918138Z #define cudaArrayTextureGather 0x08 2025-05-07T20:26:26.0918238Z #define _GLIBCXX_LT_OBJDIR ".libs/" 2025-05-07T20:26:26.0918393Z #define __inline_hint__ __attribute__((nv_inline_hint)) 2025-05-07T20:26:26.0918493Z #define __NV_LEGACY_LAUNCH 1 2025-05-07T20:26:26.0918582Z #define _IO_off_t __off_t 2025-05-07T20:26:26.0918668Z #define __FLT64_DIG__ 15 2025-05-07T20:26:26.0918903Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS 2025-05-07T20:26:26.0918999Z #define _POSIX2_LINE_MAX 2048 2025-05-07T20:26:26.0919134Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:26.0919258Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:26:26.0919354Z #define ADJ_FREQUENCY 0x0002 2025-05-07T20:26:26.0919461Z #define __CUDART_API_PTDS(api) api 2025-05-07T20:26:26.0919544Z #define NULL __null 2025-05-07T20:26:26.0919700Z #define cudaStreamPerThread ((cudaStream_t)0x2) 2025-05-07T20:26:26.0919810Z #define _GLIBCXX_CONSTEXPR constexpr 2025-05-07T20:26:26.0919912Z #define __U64_TYPE unsigned long int 2025-05-07T20:26:26.0920008Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:26:26.0931729Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:26:26.0931869Z #define FP_ZERO 2 2025-05-07T20:26:26.0932085Z #define _GLIBCXX_HAVE_FLOORL 1 2025-05-07T20:26:26.0932248Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l)) 2025-05-07T20:26:26.0932364Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:26.0932458Z #define __WCHAR_T__ 2025-05-07T20:26:26.0932556Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:26:26.0932763Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:26:26.0932914Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__)) 2025-05-07T20:26:26.0933012Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:26:26.0933139Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:26:26.0933254Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:26.0933384Z #define __WSTOPSIG(status) __WEXITSTATUS(status) 2025-05-07T20:26:26.0933517Z #define cudaSurfaceTypeCubemapLayered 0xFC 2025-05-07T20:26:26.0933608Z #define _BSD_PTRDIFF_T_ 2025-05-07T20:26:26.0933702Z #define _SIGSET_H_types 1 2025-05-07T20:26:26.0933820Z #define cudaTextureType1DLayered 0xF1 2025-05-07T20:26:26.0933925Z #define __cpp_unicode_literals 200710L 2025-05-07T20:26:26.0934238Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l)) 2025-05-07T20:26:26.0934344Z #define __LONG_LONG_PAIR(HI,LO) LO, HI 2025-05-07T20:26:26.0934465Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:26:26.0934604Z #define __bos0(ptr) __builtin_object_size (ptr, 0) 2025-05-07T20:26:26.0934712Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:26:26.0934841Z #define M_1_PIl 0.318309886183790671537767526745028724L 2025-05-07T20:26:26.0934960Z #define __CUDACC_DEVICE_ATOMIC_BUILTINS__ 1 2025-05-07T20:26:26.0935135Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status)) 2025-05-07T20:26:26.0935231Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:26:26.0935341Z #define _POSIX2_CHARCLASS_NAME_MAX 14 2025-05-07T20:26:26.0935443Z #define _GLIBCXX_BITS_STD_ABS_H 2025-05-07T20:26:26.0935538Z #define STA_MODE 0x4000 2025-05-07T20:26:26.0935738Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:26:26.0935840Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:26:26.0935960Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0) 2025-05-07T20:26:26.0936071Z #define __USING_NAMESPACE_C99(name) 2025-05-07T20:26:26.0936166Z #define BIG_ENDIAN __BIG_ENDIAN 2025-05-07T20:26:26.0936277Z #define __cudaCDP2EventRecord_ptsz 2025-05-07T20:26:26.0936372Z #define _GLIBCXX_HAVE_SINL 1 2025-05-07T20:26:26.0936486Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX 2025-05-07T20:26:26.0936582Z #define __SIZE_WIDTH__ 64 2025-05-07T20:26:26.0936698Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:26.0936780Z #define __SEG_FS 1 2025-05-07T20:26:26.0936877Z #define _IO_size_t size_t 2025-05-07T20:26:26.0936974Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:26:26.0937077Z #define INT_MIN (-INT_MAX - 1) 2025-05-07T20:26:26.0937163Z #define __stub_lchmod 2025-05-07T20:26:26.0937255Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:26:26.0937377Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:26.0937473Z #define _GLIBCXX_MANGLE_SIZE_T m 2025-05-07T20:26:26.0937556Z #define __SEG_GS 1 2025-05-07T20:26:26.0937753Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:26:26.0937842Z #define _IOS_APPEND 8 2025-05-07T20:26:26.0937936Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:26:26.0938036Z #define _GLIBCXX_RELEASE 11 2025-05-07T20:26:26.0938133Z #define _GLIBCXX98_USE_C99_WCHAR 1 2025-05-07T20:26:26.0938230Z #define _IO_IS_APPENDING 0x1000 2025-05-07T20:26:26.0938336Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:26:26.0938422Z #define htole16(x) (x) 2025-05-07T20:26:26.0938537Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:26:26.0938631Z #define _GLIBCXX_HAVE_FCNTL_H 1 2025-05-07T20:26:26.0938725Z #define __INT16_TYPE__ short int 2025-05-07T20:26:26.0938833Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:26:26.0938941Z #define __glibcxx_class_requires(_a,_b) 2025-05-07T20:26:26.0939056Z #define __cpp_structured_bindings 201606L 2025-05-07T20:26:26.0939190Z #define __align__(n) __attribute__((aligned(n))) 2025-05-07T20:26:26.0939281Z #define __SIZEOF_INT__ 4 2025-05-07T20:26:26.0939376Z #define __WCLONE 0x80000000 2025-05-07T20:26:26.0939476Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:26:26.0939560Z #define SEEK_HOLE 4 2025-05-07T20:26:26.0939654Z #define TIMER_ABSTIME 1 2025-05-07T20:26:26.0939748Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:26:26.0939840Z #define __CUDA_MATH_CRTIMP 2025-05-07T20:26:26.0940022Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:26:26.0940135Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:26.0940230Z #define __DRIVER_FUNCTIONS_H__ 2025-05-07T20:26:26.0940344Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:26:26.0940441Z #define __MATH_FUNCTIONS_HPP__ 2025-05-07T20:26:26.0940562Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:26:26.0940663Z #define _LINUX_LIMITS_H 2025-05-07T20:26:26.0940749Z #define linux 1 2025-05-07T20:26:26.0940844Z #define MOD_MICRO ADJ_MICRO 2025-05-07T20:26:26.0940963Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 2025-05-07T20:26:26.0941146Z #define _GLIBCXX_HAVE_VSWSCANF 1 2025-05-07T20:26:26.0941250Z #define _GLIBCXX_HAVE_ISNAN 1 2025-05-07T20:26:26.0941357Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV 2025-05-07T20:26:26.0941503Z #define __cudart_builtin__ __location__(cudart_builtin) 2025-05-07T20:26:26.0941606Z #define __cpp_lib_hypot 201603 2025-05-07T20:26:26.0941703Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:26:26.0941801Z #define _GLIBCXX_HAVE_WCTYPE_H 1 2025-05-07T20:26:26.0941896Z #define MOD_NANO ADJ_NANO 2025-05-07T20:26:26.0941988Z #define htole64(x) (x) 2025-05-07T20:26:26.0942088Z #define FP_ILOGBNAN (-2147483647 - 1) 2025-05-07T20:26:26.0942220Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_)) 2025-05-07T20:26:26.0942314Z #define _IO_UPPERCASE 01000 2025-05-07T20:26:26.0942822Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference 2025-05-07T20:26:26.0942992Z #define __USE_POSIX2 1 2025-05-07T20:26:26.0943090Z #define MOD_ESTERROR ADJ_ESTERROR 2025-05-07T20:26:26.0943192Z #define __WALL 0x40000000 2025-05-07T20:26:26.0943288Z #define _GLIBCXX_HAVE_LDEXPF 1 2025-05-07T20:26:26.0943373Z #define _XLOCALE_H 1 2025-05-07T20:26:26.0943473Z #define _GLIBCXX_USE_TMPNAM 1 2025-05-07T20:26:26.0943568Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:26:26.0943662Z #define __KEY_T_TYPE __S32_TYPE 2025-05-07T20:26:26.0943771Z #define __cudaGet_threadIdx() threadIdx 2025-05-07T20:26:26.0943859Z #define __EXCEPTIONS 1 2025-05-07T20:26:26.0943957Z #define __CUDART_API_PTSZ(api) api 2025-05-07T20:26:26.0944159Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__)) 2025-05-07T20:26:26.0944247Z #define __WORDSIZE 64 2025-05-07T20:26:26.0944345Z #define CLOCK_MONOTONIC 1 2025-05-07T20:26:26.0944434Z #define _STL_RELOPS_H 1 2025-05-07T20:26:26.0944531Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:26:26.0944638Z #define __BEGIN_DECLS extern "C" { 2025-05-07T20:26:26.0944733Z #define _GLIBCXX_HAVE_SYS_IPC_H 1 2025-05-07T20:26:26.0944824Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:26:26.0944934Z #define _GLIBCXX_HAVE_TRUNCATE 1 2025-05-07T20:26:26.0945238Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension 2025-05-07T20:26:26.0945473Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:26:26.0945612Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11:: 2025-05-07T20:26:26.0945709Z #define _GLIBCXX_NUMERIC_LIMITS 1 2025-05-07T20:26:26.0945819Z #define __cpp_range_based_for 201603L 2025-05-07T20:26:26.0945931Z #define __cpp_lib_exchange_function 201304 2025-05-07T20:26:26.0946030Z #define _GLIBCXX_HAVE_INTTYPES_H 1 2025-05-07T20:26:26.0946143Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1 2025-05-07T20:26:26.0946326Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02 2025-05-07T20:26:26.0946429Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:26:26.0946528Z #define _GLIBCXX_CSTDLIB 1 2025-05-07T20:26:26.0946631Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1 2025-05-07T20:26:26.0946814Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:26:26.0946936Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:26:26.0947020Z #define _STRING_H 1 2025-05-07T20:26:26.0947120Z #define _BITS_PTHREADTYPES_H 1 2025-05-07T20:26:26.0947214Z #define _GCC_MAX_ALIGN_T 2025-05-07T20:26:26.0947310Z #define __SM_32_INTRINSICS_HPP__ 2025-05-07T20:26:26.0947449Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:26:26.0947542Z #define __code_model_small__ 1 2025-05-07T20:26:26.0947630Z #define _PSTL_CONFIG_H 2025-05-07T20:26:26.0947736Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:26:26.0947848Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:26:26.0947943Z #define __SM_20_INTRINSICS_H__ 2025-05-07T20:26:26.0948051Z #define cudaCpuDeviceId ((int)-1) 2025-05-07T20:26:26.0948404Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:26:26.0948498Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:26:26.0948706Z #define le64toh(x) (x) 2025-05-07T20:26:26.0948797Z #define FILENAME_MAX 4096 2025-05-07T20:26:26.0948955Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l)) 2025-05-07T20:26:26.0949069Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:26:26.0949153Z #define L_cuserid 9 2025-05-07T20:26:26.0949244Z #define __ino_t_defined 2025-05-07T20:26:26.0949324Z #define __k8__ 1 2025-05-07T20:26:26.0949420Z #define __INTPTR_TYPE__ long int 2025-05-07T20:26:26.0949534Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:26:26.0949626Z #define __int8_t_defined 2025-05-07T20:26:26.0949718Z #define __WCHAR_TYPE__ int 2025-05-07T20:26:26.0949823Z #define __CLOCKID_T_TYPE __S32_TYPE 2025-05-07T20:26:26.0949936Z #define cudaHostRegisterPortable 0x01 2025-05-07T20:26:26.0950222Z #define __SLONGWORD_TYPE long int 2025-05-07T20:26:26.0950347Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++" 2025-05-07T20:26:26.0950496Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l)) 2025-05-07T20:26:26.0950594Z #define __HAVE_COLUMN 2025-05-07T20:26:26.0950681Z #define __stub_fdetach 2025-05-07T20:26:26.0951102Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported. Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead." 2025-05-07T20:26:26.0951190Z #define __pic__ 2 2025-05-07T20:26:26.0951309Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:26.0951404Z #define CLOCKS_PER_SEC 1000000l 2025-05-07T20:26:26.0951502Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:26:26.0951602Z #define _GLIBCXX_HAVE_SOCKATMARK 1 2025-05-07T20:26:26.0951688Z #define __stub_chflags 2025-05-07T20:26:26.0951781Z #define CLOCK_BOOTTIME 7 2025-05-07T20:26:26.0951865Z #define __need_IOV_MAX 2025-05-07T20:26:26.0951978Z #define putc(_ch,_fp) _IO_putc (_ch, _fp) 2025-05-07T20:26:26.0952086Z #define __UQUAD_TYPE unsigned long int 2025-05-07T20:26:26.0952182Z #define __cpp_decltype 200707L 2025-05-07T20:26:26.0952285Z #define __BYTE_ORDER __LITTLE_ENDIAN 2025-05-07T20:26:26.0952380Z #define _GLIBCXX_USE_C99 1 2025-05-07T20:26:26.0952487Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1 2025-05-07T20:26:26.0952578Z #define TTY_NAME_MAX 32 2025-05-07T20:26:26.0952747Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val) 2025-05-07T20:26:26.0952869Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:26.0953043Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition) 2025-05-07T20:26:26.0953152Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:26:26.0953250Z #define __LITTLE_ENDIAN 1234 2025-05-07T20:26:26.0953343Z #define STA_PPSTIME 0x0004 2025-05-07T20:26:26.0953424Z #define __import__ 2025-05-07T20:26:26.0953517Z #define BUFSIZ _IO_BUFSIZ 2025-05-07T20:26:26.0953652Z #define M_SQRT2l 1.414213562373095048801688724209698079L 2025-05-07T20:26:26.0953744Z #define __export__ 2025-05-07T20:26:26.0953889Z #define __FSID_T_TYPE struct { int __val[2]; } 2025-05-07T20:26:26.0954007Z #define cudaMemAttachHost 0x02 2025-05-07T20:26:26.0954181Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:26:26.0954283Z #define _GLIBCXX_HAVE_ICONV 1 2025-05-07T20:26:26.0954372Z #define _GLIBCXX_SYMVER 1 2025-05-07T20:26:26.0954466Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:26:26.0954563Z #define _WCHAR_T_DECLARED 2025-05-07T20:26:26.0954682Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:26:26.0954807Z #define isalpha_l(c,l) __isalpha_l ((c), (l)) 2025-05-07T20:26:26.0954911Z #define __cpp_inline_variables 201606L 2025-05-07T20:26:26.0955000Z #define WNOWAIT 0x01000000 2025-05-07T20:26:26.0955088Z #define PLOSS 6 2025-05-07T20:26:26.0955181Z #define M_LN10 2.30258509299404568402 2025-05-07T20:26:26.0955447Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626) 2025-05-07T20:26:26.0955546Z #define EXIT_SUCCESS 0 2025-05-07T20:26:26.0955641Z #define __LDBL_REDIR_DECL(name) 2025-05-07T20:26:26.0955736Z #define _GLIBCXX_HAVE_STRTOF 1 2025-05-07T20:26:26.0955843Z #define MOD_FREQUENCY ADJ_FREQUENCY 2025-05-07T20:26:26.0956019Z #define __thread__ __thread 2025-05-07T20:26:26.0956127Z #define _GLIBCXX_HAVE_MEMORY_H 1 2025-05-07T20:26:26.0956219Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:26:26.0956321Z #define __SIZEOF_PTHREAD_BARRIER_T 32 2025-05-07T20:26:26.0956558Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:26:26.0956671Z #define __cudaCDP2StreamWaitEvent_ptsz 2025-05-07T20:26:26.0956764Z #define _GLIBCXX_HAVE_SINF 1 2025-05-07T20:26:26.0956853Z #define __linux__ 1 2025-05-07T20:26:26.0956948Z #define STA_PPSSIGNAL 0x0100 2025-05-07T20:26:26.0957074Z #define M_LN2l 0.693147180559945309417232121458176568L 2025-05-07T20:26:26.0957172Z #define __S16_TYPE short int 2025-05-07T20:26:26.0957529Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable() 2025-05-07T20:26:26.0957718Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1 2025-05-07T20:26:26.0957916Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1) 2025-05-07T20:26:26.0958013Z #define __COMMON_FUNCTIONS_H__ 2025-05-07T20:26:26.0958114Z #define UINT_MAX (INT_MAX * 2U + 1U) 2025-05-07T20:26:26.0958196Z #define _T_SIZE_ 2025-05-07T20:26:26.0958297Z #define LLONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:26:26.0958423Z #define __cudaCDP2StreamCreateWithFlags 2025-05-07T20:26:26.0958519Z #define _PSTL_VERSION 12000 2025-05-07T20:26:26.0958641Z #define __noinline__ __attribute__((noinline)) 2025-05-07T20:26:26.0958746Z #define __WNOTHREAD 0x20000000 2025-05-07T20:26:26.0958841Z #define _G_va_list __gnuc_va_list 2025-05-07T20:26:26.0958970Z #define M_PI_4l 0.785398163397448309615660845819875721L 2025-05-07T20:26:26.0959059Z #define _IOS_INPUT 1 2025-05-07T20:26:26.0959150Z #define __USE_LARGEFILE64 1 2025-05-07T20:26:26.0959263Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1 2025-05-07T20:26:26.0959356Z #define __INT64_TYPE__ long int 2025-05-07T20:26:26.0959451Z #define _POSIX_SSIZE_MAX 32767 2025-05-07T20:26:26.0959555Z #define __shared__ __location__(shared) 2025-05-07T20:26:26.0959650Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:26:26.0959805Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0) 2025-05-07T20:26:26.0959899Z #define __gid_t_defined 2025-05-07T20:26:26.0960012Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1 2025-05-07T20:26:26.0960108Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:26:26.0960313Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 2025-05-07T20:26:26.0960408Z #define _GLIBCXX17_INLINE inline 2025-05-07T20:26:26.0960505Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:26:26.0960591Z #define ___int_size_t_h 2025-05-07T20:26:26.0960696Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:26.0960824Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:26:26.0960980Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED) 2025-05-07T20:26:26.0961087Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1 2025-05-07T20:26:26.0961192Z #define _GLIBCXX_HAVE_FENV_H 1 2025-05-07T20:26:26.0961294Z #define _GLIBCXX_HAVE_STDBOOL_H 1 2025-05-07T20:26:26.0961391Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:26:26.0961521Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:26.0961633Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1 2025-05-07T20:26:26.0961752Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 2025-05-07T20:26:26.0961850Z #define __clock_t_defined 1 2025-05-07T20:26:26.0961948Z #define _POSIX_SEM_VALUE_MAX 32767 2025-05-07T20:26:26.0962066Z #define __cudaCDP2RuntimeGetVersion 2025-05-07T20:26:26.0962155Z #define __GLIBC_MINOR__ 17 2025-05-07T20:26:26.0962246Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:26:26.0962347Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:26:26.0962456Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:26:26.0962545Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:26:26.0962729Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:26:26.0962809Z #define __SSE__ 1 2025-05-07T20:26:26.0962903Z #define SEM_VALUE_MAX (2147483647) 2025-05-07T20:26:26.0963096Z #define M_SQRT1_2 0.70710678118654752440 2025-05-07T20:26:26.0963181Z #define _CTYPE_H 1 2025-05-07T20:26:26.0963278Z #define __sigset_t_defined 2025-05-07T20:26:26.0963372Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:26:26.0963466Z #define _GLIBCXX_HAVE_LOGF 1 2025-05-07T20:26:26.0963558Z #define MOD_TAI ADJ_TAI 2025-05-07T20:26:26.0963652Z #define _IO_va_list __gnuc_va_list 2025-05-07T20:26:26.0963744Z #define _GLIBCXX_HAVE_LOGL 1 2025-05-07T20:26:26.0963832Z #define __SM_70_RT_H__ 2025-05-07T20:26:26.0963924Z #define _GLIBCXX_HAVE_WRITEV 1 2025-05-07T20:26:26.0964029Z #define cudaEventWaitDefault 0x00 2025-05-07T20:26:26.0964141Z #define _GLIBCXX_HAVE_EXPL 1 2025-05-07T20:26:26.0964333Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:26:26.0964515Z #define _POSIX_MAX_CANON 255 2025-05-07T20:26:26.0964634Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE 2025-05-07T20:26:26.0964733Z #define FD_SETSIZE __FD_SETSIZE 2025-05-07T20:26:26.0964829Z #define _GLIBCXX_TXN_SAFE 2025-05-07T20:26:26.0964916Z #define __amd64__ 1 2025-05-07T20:26:26.0965007Z #define __WINT_WIDTH__ 32 2025-05-07T20:26:26.0965118Z #define __CUDA_DEVICE_RUNTIME_API_H__ 2025-05-07T20:26:26.0965390Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:26.0965489Z #define _GLIBCXX_STDIO_SEEK_CUR 1 2025-05-07T20:26:26.0965576Z #define EOF (-1) 2025-05-07T20:26:26.0965672Z #define __WAIT_STATUS_DEFN void * 2025-05-07T20:26:26.0965766Z #define __USE_POSIX199309 1 2025-05-07T20:26:26.0965867Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:26:26.0965963Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:26:26.0966060Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:26:26.0966162Z #define LLONG_MIN (-LLONG_MAX-1) 2025-05-07T20:26:26.0966280Z #define cudaSurfaceType2DLayered 0xF2 2025-05-07T20:26:26.0966387Z #define ____mbstate_t_defined 1 2025-05-07T20:26:26.0966477Z #define STA_NANO 0x2000 2025-05-07T20:26:26.0966574Z #define _GLIBCXX_HAVE_LOG10F 1 2025-05-07T20:26:26.0966681Z #define _GLIBCXX_HAVE_LOG10L 1 2025-05-07T20:26:26.0966768Z #define _IO_LINKED 0x80 2025-05-07T20:26:26.0966865Z #define __cpp_lib_launder 201606 2025-05-07T20:26:26.0966967Z #define __SIZEOF_INT128__ 16 2025-05-07T20:26:26.0967069Z #define __PTHREAD_MUTEX_HAVE_PREV 1 2025-05-07T20:26:26.0967163Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:26:26.0967265Z #define _GLIBCXX_TYPE_TRAITS 1 2025-05-07T20:26:26.0967408Z #define cudaGraphKernelNodePortProgrammatic 1 2025-05-07T20:26:26.0967516Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:26.0967623Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE 2025-05-07T20:26:26.0967718Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:26:26.0967820Z #define __W_CONTINUED 0xffff 2025-05-07T20:26:26.0967911Z #define __ATOMIC_RELAXED 0 2025-05-07T20:26:26.0968047Z #define w_coredump __wait_terminated.__w_coredump 2025-05-07T20:26:26.0968177Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:26.0968385Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 2025-05-07T20:26:26.0968574Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:26:26.0968665Z #define __stub_stty 2025-05-07T20:26:26.0968831Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)]) 2025-05-07T20:26:26.0968918Z #define le16toh(x) (x) 2025-05-07T20:26:26.0969036Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX 2025-05-07T20:26:26.0969210Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:26:26.0969298Z #define _SIZET_ 2025-05-07T20:26:26.0969393Z #define XATTR_NAME_MAX 255 2025-05-07T20:26:26.0969480Z #define _SVID_SOURCE 1 2025-05-07T20:26:26.0969565Z #define _LP64 1 2025-05-07T20:26:26.0969653Z #define _LIBC_LIMITS_H_ 1 2025-05-07T20:26:26.0969894Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias) 2025-05-07T20:26:26.0970016Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1 2025-05-07T20:26:26.0970101Z #define __UINT8_C(c) c 2025-05-07T20:26:26.0970281Z #define _GLIBCXX_HAVE_CEILF 1 2025-05-07T20:26:26.0970382Z #define _GLIBCXX_HAVE_CEILL 1 2025-05-07T20:26:26.0970494Z #define __cudaCDP2Memset3DAsync_ptsz 2025-05-07T20:26:26.0970587Z #define __CUDA_ARCH_LIST__ 520 2025-05-07T20:26:26.0970685Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:26:26.0970782Z #define MOD_MAXERROR ADJ_MAXERROR 2025-05-07T20:26:26.0970873Z #define CUDARTAPI 2025-05-07T20:26:26.0970958Z #define IOV_MAX 1024 2025-05-07T20:26:26.0971104Z #define __glibcxx_requires_irreflexive2(_First,_Last) 2025-05-07T20:26:26.0971209Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:26:26.0971306Z #define P_tmpdir "/tmp" 2025-05-07T20:26:26.0971408Z #define cudaMemAttachSingle 0x04 2025-05-07T20:26:26.0971496Z #define __wchar_t__ 2025-05-07T20:26:26.0971600Z #define __cpp_lib_is_aggregate 201703 2025-05-07T20:26:26.0971763Z #define SEEK_END 2 2025-05-07T20:26:26.0971859Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:26:26.0972167Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include() 2025-05-07T20:26:26.0972281Z #define _IO_ftrylockfile(_fp) 2025-05-07T20:26:26.0972434Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR 2025-05-07T20:26:26.0972526Z #define ____FILE_defined 1 2025-05-07T20:26:26.0972653Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1 2025-05-07T20:26:26.0972751Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:26:26.0972840Z #define _ISOC99_SOURCE 1 2025-05-07T20:26:26.0972945Z #define __VECTOR_FUNCTIONS_H__ 2025-05-07T20:26:26.0973199Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:26.0973331Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 2025-05-07T20:26:26.0973420Z #define _IO_RIGHT 04 2025-05-07T20:26:26.0973515Z #define __END_NAMESPACE_STD 2025-05-07T20:26:26.0973703Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:26:26.0973807Z #define _GLIBCXX_STD_C std 2025-05-07T20:26:26.0973930Z #define cudaInitDeviceFlagsAreValid 0x01 2025-05-07T20:26:26.0974029Z #define _LARGEFILE64_SOURCE 1 2025-05-07T20:26:26.0974133Z #define _GLIBCXX_USE_C99_STDINT_TR1 1 2025-05-07T20:26:26.0974217Z #define _STDDEF_H_ 2025-05-07T20:26:26.0974399Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:26:26.0974497Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:26:26.0974615Z #define isalnum_l(c,l) __isalnum_l ((c), (l)) 2025-05-07T20:26:26.0974823Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0) 2025-05-07T20:26:26.0974936Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:26.0975080Z #define __glibcxx_requires_irreflexive(_First,_Last) 2025-05-07T20:26:26.0975208Z #define cudaGraphKernelNodePortDefault 0 2025-05-07T20:26:26.0975310Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:26:26.0975424Z #define __cudaCDP2Memcpy3DAsync_ptsz 2025-05-07T20:26:26.0975524Z #define __PID_T_TYPE __S32_TYPE 2025-05-07T20:26:26.0975637Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:26:26.0975739Z #define CHARCLASS_NAME_MAX 2048 2025-05-07T20:26:26.0975836Z #define _GLIBCXX_HAVE_TANF 1 2025-05-07T20:26:26.0975930Z #define _GLIBCXX_USE_ST_MTIM 1 2025-05-07T20:26:26.0976110Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:26:26.0976215Z #define __CUDA_RUNTIME_H__ 2025-05-07T20:26:26.0976394Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status)) 2025-05-07T20:26:26.0976500Z #define _GLIBCXX_HAVE_STDLIB_H 1 2025-05-07T20:26:26.0976595Z #define __STDCPP_THREADS__ 1 2025-05-07T20:26:26.0976747Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L 2025-05-07T20:26:26.0976843Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:26:26.0976936Z #define _POSIX_UIO_MAXIOV 16 2025-05-07T20:26:26.0977040Z #define _PSTL_PAR_BACKEND_SERIAL 2025-05-07T20:26:26.0977160Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__ 2025-05-07T20:26:26.0977259Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:26:26.0977366Z #define __WORDSIZE_TIME64_COMPAT32 1 2025-05-07T20:26:26.0977629Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__)) 2025-05-07T20:26:26.0977807Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:26:26.0977912Z #define _PSTL_HIDE_FROM_ABI_PUSH 2025-05-07T20:26:26.0978033Z #define cudaStreamLegacy ((cudaStream_t)0x1) 2025-05-07T20:26:26.0978153Z #define _IO_cleanup_region_start(_fct,_fp) 2025-05-07T20:26:26.0978255Z #define __location__(a) __annotate__(a) 2025-05-07T20:26:26.0978490Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type) 2025-05-07T20:26:26.0978597Z #define _POSIX2_BC_BASE_MAX 99 2025-05-07T20:26:26.0978710Z #define __cudaCDP2DeviceGetAttribute 2025-05-07T20:26:26.0978808Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:26:26.0978903Z #define __STDC_UTF_32__ 1 2025-05-07T20:26:26.0978995Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:26:26.0979204Z #define NAN (__builtin_nanf ("")) 2025-05-07T20:26:26.0979306Z #define _POSIX_MQ_PRIO_MAX 32 2025-05-07T20:26:26.0979388Z #define __FXSR__ 1 2025-05-07T20:26:26.0979472Z #define _SIZE_T 2025-05-07T20:26:26.0979588Z #define _GLIBCXX_USE_GETTIMEOFDAY 1 2025-05-07T20:26:26.0979700Z #define cudaHostRegisterReadOnly 0x08 2025-05-07T20:26:26.0979879Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:26:26.0980028Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f) 2025-05-07T20:26:26.0980122Z #define _IO_ssize_t __ssize_t 2025-05-07T20:26:26.0980226Z #define __ULONG32_TYPE unsigned int 2025-05-07T20:26:26.0980413Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:26:26.0980616Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000 2025-05-07T20:26:26.0980716Z #define _GXX_NULLPTR_T 2025-05-07T20:26:26.0980840Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 2025-05-07T20:26:26.0981014Z #define FOPEN_MAX 16 2025-05-07T20:26:26.0981174Z #define __BIG_ENDIAN 4321 2025-05-07T20:26:26.0981333Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:26:26.0981498Z #define __suseconds_t_defined 2025-05-07T20:26:26.0981859Z #define __off_t_defined 2025-05-07T20:26:26.0981996Z #define stderr stderr 2025-05-07T20:26:26.0982122Z #define M_LOG10E 0.43429448190325182765 2025-05-07T20:26:26.0982303Z #define __glibcxx_requires_string(_String) 2025-05-07T20:26:26.0982432Z #define _GLIBCXX_HAVE_LDEXPL 1 2025-05-07T20:26:26.0982556Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:26:26.0983101Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304) 2025-05-07T20:26:26.0983239Z #define __mode_t_defined 2025-05-07T20:26:26.0983392Z #define _GCC_SIZE_T 2025-05-07T20:26:26.0983521Z #define __INO64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:26.0983655Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:26:26.0983838Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:26:26.0984017Z #define __USE_XOPEN2K8XSI 1 2025-05-07T20:26:26.0984245Z #define __UINT32_C(c) c ## U 2025-05-07T20:26:26.0984418Z #define __cpp_alias_templates 200704L 2025-05-07T20:26:26.0984561Z #define cudaHostAllocMapped 0x02 2025-05-07T20:26:26.0984697Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 2025-05-07T20:26:26.0984862Z #define _STL_ITERATOR_H 1 2025-05-07T20:26:26.0985027Z #define __size_t__ 2025-05-07T20:26:26.0985241Z #define cudaStreamAttrID cudaLaunchAttributeID 2025-05-07T20:26:26.0985368Z #define _GLIBCXX_HAVE_ATANF 1 2025-05-07T20:26:26.0985510Z #define cudaEventRecordExternal 0x01 2025-05-07T20:26:26.0985751Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l)) 2025-05-07T20:26:26.0985862Z #define _IO_BUFSIZ _G_BUFSIZ 2025-05-07T20:26:26.0986110Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:26:26.0986279Z #define _ENDIAN_H 1 2025-05-07T20:26:26.0986417Z #define __builtin_align__(a) __align__(a) 2025-05-07T20:26:26.0986608Z #define _GLIBCXX20_CONSTEXPR 2025-05-07T20:26:26.0986740Z #define __NV_NO_HOST_COMPILER_CHECK 1 2025-05-07T20:26:26.0986837Z #define __try try 2025-05-07T20:26:26.0987070Z #define _GLIBCXX_HAVE_FINITE 1 2025-05-07T20:26:26.0987287Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:26:26.0987410Z #define __INT8_MAX__ 0x7f 2025-05-07T20:26:26.0987763Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2) 2025-05-07T20:26:26.0987889Z #define __LONG_WIDTH__ 64 2025-05-07T20:26:26.0988069Z #define __PIC__ 2 2025-05-07T20:26:26.0988311Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX 2025-05-07T20:26:26.0988463Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:26:26.0988687Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp) 2025-05-07T20:26:26.0988818Z #define _GLIBCXX_HAVE_FLOAT_H 1 2025-05-07T20:26:26.0988943Z #define _GLIBCXX_HAVE_ATANL 1 2025-05-07T20:26:26.0989229Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:26:26.0989466Z #define __DEVICE_FUNCTIONS_HPP__ 2025-05-07T20:26:26.0989623Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:26:26.0989782Z #define _IO_uid_t __uid_t 2025-05-07T20:26:26.0989912Z #define _GLIBCXX_HAVE_READLINK 1 2025-05-07T20:26:26.0990080Z #define __cudaCDP2EventRecordWithFlags_ptsz 2025-05-07T20:26:26.0990283Z #define _CONCEPT_CHECK_H 1 2025-05-07T20:26:26.0990499Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:26:26.0990675Z #define _GLIBCXX_HAVE_NETINET_IN_H 1 2025-05-07T20:26:26.0990829Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1 2025-05-07T20:26:26.0990946Z #define LONG_BIT 64 2025-05-07T20:26:26.0991108Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4 2025-05-07T20:26:26.0991303Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1 2025-05-07T20:26:26.0991501Z #define __cpp_lib_math_special_functions 201603L 2025-05-07T20:26:26.0991666Z #define __fsfilcnt_t_defined 2025-05-07T20:26:26.0991789Z #define __blkcnt_t_defined 2025-05-07T20:26:26.0992174Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:26:26.0992324Z #define __USE_LARGEFILE 1 2025-05-07T20:26:26.0992505Z #define __cpp_constexpr 201603L 2025-05-07T20:26:26.0992706Z #define CUDART_VERSION 12080 2025-05-07T20:26:26.0992833Z #define NL_TEXTMAX INT_MAX 2025-05-07T20:26:26.0992966Z #define cudaDeviceMapHost 0x08 2025-05-07T20:26:26.0993120Z #define _GLIBCXX_CMATH 1 2025-05-07T20:26:26.0993338Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x))) 2025-05-07T20:26:26.0993533Z #define __lldiv_t_defined 1 2025-05-07T20:26:26.0993701Z #define __SSE2__ 1 2025-05-07T20:26:26.0993816Z #define _IOLBF 1 2025-05-07T20:26:26.0993970Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1 2025-05-07T20:26:26.0994155Z #define _GLIBCXX_HAVE_FLOORF 1 2025-05-07T20:26:26.0994275Z #define __cpp_deduction_guides 201703L 2025-05-07T20:26:26.0994528Z #define _GLIBCXX_HAVE_EXPF 1 2025-05-07T20:26:26.0994669Z #define __annotate__(a) __attribute__((a)) 2025-05-07T20:26:26.0994789Z #define __INT32_TYPE__ int 2025-05-07T20:26:26.0994954Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:26:26.0995094Z #define cudaDeviceSyncMemops 0x80 2025-05-07T20:26:26.0995235Z #define __cpp_exceptions 199711L 2025-05-07T20:26:26.0995466Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:26:26.0995607Z #define cudaDeviceScheduleYield 0x02 2025-05-07T20:26:26.0995732Z #define _SYS_SYSMACROS_H 1 2025-05-07T20:26:26.0996000Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1 2025-05-07T20:26:26.0996216Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:26:26.0996417Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:26:26.0996560Z #define __SWORD_TYPE long int 2025-05-07T20:26:26.0996684Z #define __INTMAX_TYPE__ long int 2025-05-07T20:26:26.0996847Z #define _GLIBCXX11_USE_C99_MATH 1 2025-05-07T20:26:26.0996973Z #define __PTHREAD_SPINS 0, 0 2025-05-07T20:26:26.0997118Z #define _BITS_POSIX1_LIM_H 1 2025-05-07T20:26:26.0997518Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:26:26.0997667Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:26:26.0997885Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT) 2025-05-07T20:26:26.0997998Z #define _T_SIZE 2025-05-07T20:26:26.0998247Z #define cudaHostAllocDefault 0x00 2025-05-07T20:26:26.0998427Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 2025-05-07T20:26:26.0998635Z #define __va_arg_pack() __builtin_va_arg_pack () 2025-05-07T20:26:26.0998777Z #define _POSIX_TIMER_MAX 32 2025-05-07T20:26:26.0998933Z #define _GLIBCXX_HAVE_TLS 1 2025-05-07T20:26:26.0999111Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT 2025-05-07T20:26:26.0999240Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:26:26.0999384Z #define __ATOMIC_CONSUME 1 2025-05-07T20:26:26.0999641Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT 2025-05-07T20:26:26.0999817Z #define __GNUC_MINOR__ 4 2025-05-07T20:26:26.1000050Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:26:26.1000175Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:26:26.1000362Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:26.1000545Z #define __PIE__ 2 2025-05-07T20:26:26.1000725Z #define LITTLE_ENDIAN __LITTLE_ENDIAN 2025-05-07T20:26:26.1000927Z #define _GLIBCXX_HAVE_INT64_T_LONG 1 2025-05-07T20:26:26.1001159Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:26:26.1001413Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:26:26.1001573Z #define __nlink_t_defined 2025-05-07T20:26:26.1001717Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]] 2025-05-07T20:26:26.1001985Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x) 2025-05-07T20:26:26.1002105Z #define _XOPEN_LIM_H 1 2025-05-07T20:26:26.1002404Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:26:26.1002592Z #define __cpp_template_template_args 201611L 2025-05-07T20:26:26.1002728Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1 2025-05-07T20:26:26.1002843Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX 2025-05-07T20:26:26.1003099Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:26:26.1003221Z #define __FILE_defined 1 2025-05-07T20:26:26.1003471Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:26:26.1003603Z #define _GLIBCXX_HAVE_SINCOS 1 2025-05-07T20:26:26.1003728Z #define __USE_XOPEN_EXTENDED 1 2025-05-07T20:26:26.1003941Z #define __cpp_lib_tuple_element_t 201402L 2025-05-07T20:26:26.1004141Z #define isascii_l(c,l) __isascii_l ((c), (l)) 2025-05-07T20:26:26.1004380Z #define cudaInvalidDeviceId ((int)-2) 2025-05-07T20:26:26.1004551Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1 2025-05-07T20:26:26.1004666Z #define __INT16_C(c) c 2025-05-07T20:26:26.1004792Z #define __U32_TYPE unsigned int 2025-05-07T20:26:26.1005018Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1 2025-05-07T20:26:26.1005188Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp) 2025-05-07T20:26:26.1005336Z #define __STDC__ 1 2025-05-07T20:26:26.1005462Z #define _GLIBCXX_HAVE_VWSCANF 1 2025-05-07T20:26:26.1005591Z #define _GLIBCXX_HAVE_EXECINFO_H 1 2025-05-07T20:26:26.1005743Z #define _GLIBCXX_USE_REALPATH 1 2025-05-07T20:26:26.1005999Z #define __attribute_malloc__ __attribute__ ((__malloc__)) 2025-05-07T20:26:26.1006485Z #define __FLT32X_DIG__ 15 2025-05-07T20:26:26.1006739Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1 2025-05-07T20:26:26.1006905Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:26:26.1007090Z #define cudaArrayDeferredMapping 0x80 2025-05-07T20:26:26.1007324Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 2025-05-07T20:26:26.1007571Z #define USHRT_MAX (SHRT_MAX * 2 + 1) 2025-05-07T20:26:26.1007795Z #define __cpp_lib_is_swappable 201603 2025-05-07T20:26:26.1007909Z #define stdin stdin 2025-05-07T20:26:26.1008029Z #define __ino64_t_defined 2025-05-07T20:26:26.1008209Z #define STA_CLK 0x8000 2025-05-07T20:26:26.1008342Z #define __clockid_t_defined 1 2025-05-07T20:26:26.1008579Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__) 2025-05-07T20:26:26.1008901Z #define __attribute_noinline__ __attribute__ ((__noinline__)) 2025-05-07T20:26:26.1009042Z #define __cudaCDP2MemsetAsync 2025-05-07T20:26:26.1009178Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 2025-05-07T20:26:26.1009377Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 2025-05-07T20:26:26.1009743Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1 2025-05-07T20:26:26.1010098Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d))) 2025-05-07T20:26:26.1010223Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:26:26.1010804Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; })) 2025-05-07T20:26:26.1010981Z #define DOMAIN 1 2025-05-07T20:26:26.1011102Z #define M_LN2 0.69314718055994530942 2025-05-07T20:26:26.1011307Z #define __NVCC__ 1 2025-05-07T20:26:26.1011456Z #define __cudaCDP2Memset2DAsync 2025-05-07T20:26:26.1011600Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:26.1012022Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 2025-05-07T20:26:26.1012162Z #define __throw_exception_again throw 2025-05-07T20:26:26.1012285Z #define M_SQRT2 1.41421356237309504880 2025-05-07T20:26:26.1012495Z #define __EXCEPTION_H 1 2025-05-07T20:26:26.1012644Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:26:26.1012800Z #define HUGE_VAL (__builtin_huge_val()) 2025-05-07T20:26:26.1013179Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:26:26.1013323Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:26:26.1013549Z #define _GLIBCXX_INLINE_VERSION 0 2025-05-07T20:26:26.1013725Z #define _GLIBCXX_USE_INT128 1 2025-05-07T20:26:26.1013873Z #define __cpp_lib_bool_constant 201505 2025-05-07T20:26:26.1014058Z #define PTHREAD_KEYS_MAX 1024 2025-05-07T20:26:26.1014233Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:26:26.1014371Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:26.1014533Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1 2025-05-07T20:26:26.1014715Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:26:26.1014887Z #define __cpp_lib_tuples_by_type 201304 2025-05-07T20:26:26.1015053Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:26:26.1015189Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:26:26.1015396Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC)) 2025-05-07T20:26:26.1015509Z #define __useconds_t_defined 2025-05-07T20:26:26.1015687Z #define _GLIBCXX_USE_SCHED_YIELD 1 2025-05-07T20:26:26.1015977Z #define __attribute_deprecated__ __attribute__ ((__deprecated__)) 2025-05-07T20:26:26.1016159Z #define __cpp_lib_type_trait_variable_templates 201510L 2025-05-07T20:26:26.1016277Z #define __SSE_MATH__ 1 2025-05-07T20:26:26.1016438Z #define _IO_wint_t wint_t 2025-05-07T20:26:26.1016549Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:26:26.1016793Z #define _GLIBCXX_VERBOSE 1 2025-05-07T20:26:26.1016917Z #define _GLIBCXX_HAVE_ASINF 1 2025-05-07T20:26:26.1017063Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:26:26.1017232Z #define _GLIBCXX_HAVE_ISINFL 1 2025-05-07T20:26:26.1017432Z #define _GLIBCXX_HAVE_ASINL 1 2025-05-07T20:26:26.1017534Z #define __USE_ATFILE 1 2025-05-07T20:26:26.1017775Z #define _POSIX_OPEN_MAX 20 2025-05-07T20:26:26.1017905Z #define _POSIX_LOGIN_NAME_MAX 9 2025-05-07T20:26:26.1018023Z #define _GCC_PTRDIFF_T 2025-05-07T20:26:26.1018323Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority 2025-05-07T20:26:26.1018450Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:26:26.1018670Z #define _POSIX_THREAD_KEYS_MAX 128 2025-05-07T20:26:26.1018820Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:26:26.1018960Z #define __cpp_lib_array_constexpr 201803L 2025-05-07T20:26:26.1019110Z #define _STDLIB_H 1 2025-05-07T20:26:26.1019280Z #define __exctype(name) extern int name (int) __THROW 2025-05-07T20:26:26.1019410Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:26:26.1019622Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:26:26.1019797Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:26.1019940Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:26:26.1020102Z #define __SM_61_INTRINSICS_H__ 2025-05-07T20:26:26.1020318Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused" 2025-05-07T20:26:26.1020640Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l)) 2025-05-07T20:26:26.1020828Z #define __glibcxx_requires_nonempty() 2025-05-07T20:26:26.1020994Z #define w_stopsig __wait_stopped.__w_stopsig 2025-05-07T20:26:26.1021152Z #define __ldiv_t_defined 1 2025-05-07T20:26:26.1021364Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 2025-05-07T20:26:26.1021584Z #define ___int_ptrdiff_t_h 2025-05-07T20:26:26.1021807Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:26:26.1021991Z #define __cudaCDP2EventDestroy 2025-05-07T20:26:26.1022163Z #define __HOST_DEFINES_H__ 2025-05-07T20:26:26.1022298Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:26:26.1022451Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:26.1022697Z #define _GLIBCXX_USE_NANOSLEEP 1 2025-05-07T20:26:26.1022795Z #define CUDART_CB 2025-05-07T20:26:26.1022990Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX 2025-05-07T20:26:26.1023207Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1 2025-05-07T20:26:26.1023348Z #define MB_LEN_MAX 16 2025-05-07T20:26:26.1023610Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:26:26.1023788Z #define _GLIBCXX11_USE_C99_WCHAR 1 2025-05-07T20:26:26.1023949Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp) 2025-05-07T20:26:26.1024222Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1 2025-05-07T20:26:26.1024373Z #define _GLIBCXX_HAVE_UNISTD_H 1 2025-05-07T20:26:26.1024552Z #define __glibc_likely(cond) __builtin_expect((cond), 1) 2025-05-07T20:26:26.1024725Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:26:26.1024842Z #define _GNU_SOURCE 1 2025-05-07T20:26:26.1024945Z #define __stub_putmsg 2025-05-07T20:26:26.1025158Z #define __CUDACC__ 1 2025-05-07T20:26:26.1025299Z #define __N(msgid) (msgid) 2025-05-07T20:26:26.1025497Z #define __P(args) args 2025-05-07T20:26:26.1025822Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative 2025-05-07T20:26:26.1025956Z #define __cpp_init_captures 201304L 2025-05-07T20:26:26.1026165Z #define _GLIBCXX17_CONSTEXPR constexpr 2025-05-07T20:26:26.1026323Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:26:26.1026452Z #define __cpp_lib_as_const 201510 2025-05-07T20:26:26.1026597Z #define __WCHAR_T 2025-05-07T20:26:26.1026723Z #define __ATOMIC_RELEASE 3 2025-05-07T20:26:26.1026847Z #define __fsblkcnt_t_defined 2025-05-07T20:26:26.1027080Z #define __cudaCDP2EventCreateWithFlags 2025-05-07T20:26:26.1027250Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 2025-05-07T20:26:26.1027256Z 2025-05-07T20:26:26.1214394Z 2025-05-07T20:26:26.1215168Z + conda run -n build_binary nvcc --version 2025-05-07T20:26:26.1215184Z 2025-05-07T20:26:28.0072181Z nvcc: NVIDIA (R) Cuda compiler driver 2025-05-07T20:26:28.0072810Z Copyright (c) 2005-2025 NVIDIA Corporation 2025-05-07T20:26:28.0073497Z Built on Wed_Jan_15_19:20:09_PST_2025 2025-05-07T20:26:28.0074179Z Cuda compilation tools, release 12.8, V12.8.61 2025-05-07T20:26:28.0074811Z Build cuda_12.8.r12.8/compiler.35404655_0 2025-05-07T20:26:28.0075175Z 2025-05-07T20:26:28.0711463Z 2025-05-07T20:26:28.0721674Z /usr/bin/nvidia-smi 2025-05-07T20:26:28.0726716Z + nvidia-smi 2025-05-07T20:26:28.0726912Z 2025-05-07T20:26:28.0904158Z Wed May 7 20:26:28 2025 2025-05-07T20:26:28.0904643Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:26:28.0905324Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:26:28.0905911Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:26:28.0907495Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:26:28.0908364Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:26:28.0908960Z | | | MIG M. | 2025-05-07T20:26:28.0909697Z |=========================================+========================+======================| 2025-05-07T20:26:28.1071580Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:26:28.1072156Z | 0% 28C P8 16W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:26:28.1072588Z | | | N/A | 2025-05-07T20:26:28.1073185Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:26:28.1075591Z 2025-05-07T20:26:28.1082698Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:26:28.1083598Z | Processes: | 2025-05-07T20:26:28.1084098Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:26:28.1084518Z | ID ID Usage | 2025-05-07T20:26:28.1084864Z |=========================================================================================| 2025-05-07T20:26:28.1085308Z | No running processes found | 2025-05-07T20:26:28.1085790Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:26:28.3819881Z 2025-05-07T20:26:28.3824473Z [INSTALL] Successfully installed CUDA 12.8.0 2025-05-07T20:26:28.3872026Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0 2025-05-07T20:26:28.3872792Z . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0 2025-05-07T20:26:28.3884619Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:26:28.3884966Z env: 2025-05-07T20:26:28.3885189Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:26:28.3885485Z BUILD_ENV: build_binary 2025-05-07T20:26:28.3885728Z BUILD_TARGET: genai 2025-05-07T20:26:28.3885958Z BUILD_VARIANT: cuda 2025-05-07T20:26:28.3886183Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:26:28.3886437Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:26:28.3886738Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:26:28.3887068Z ##[endgroup] 2025-05-07T20:26:28.7270328Z ################################################################################ 2025-05-07T20:26:28.7270701Z # Install PyTorch (PIP) 2025-05-07T20:26:28.7270935Z # 2025-05-07T20:26:28.7285997Z # [2025-05-07T20:26:28.728Z] + install_pytorch_pip build_binary nightly cuda/12.8.0 2025-05-07T20:26:28.7286720Z ################################################################################ 2025-05-07T20:26:28.7287091Z 2025-05-07T20:26:28.7314137Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y numpy 2025-05-07T20:26:29.7094854Z Channels: 2025-05-07T20:26:29.7095144Z - conda-forge 2025-05-07T20:26:29.7095407Z Platform: linux-64 2025-05-07T20:26:33.0904402Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:26:33.8019505Z Solving environment: \ | / done 2025-05-07T20:26:34.0198397Z 2025-05-07T20:26:34.0198706Z ## Package Plan ## 2025-05-07T20:26:34.0198877Z 2025-05-07T20:26:34.0199091Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:26:34.0199410Z 2025-05-07T20:26:34.0199505Z added / updated specs: 2025-05-07T20:26:34.0199759Z - numpy 2025-05-07T20:26:34.0199877Z 2025-05-07T20:26:34.0199908Z 2025-05-07T20:26:34.0200040Z The following packages will be downloaded: 2025-05-07T20:26:34.0200257Z 2025-05-07T20:26:34.0200372Z package | build 2025-05-07T20:26:34.0200712Z ---------------------------|----------------- 2025-05-07T20:26:34.0201114Z libblas-3.9.0 |31_h59b9bed_openblas 16 KB conda-forge 2025-05-07T20:26:34.0201576Z libcblas-3.9.0 |31_he106b2a_openblas 16 KB conda-forge 2025-05-07T20:26:34.0202047Z libgfortran-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:26:34.0202513Z libgfortran5-15.1.0 | hcea5267_2 1.5 MB conda-forge 2025-05-07T20:26:34.0202984Z liblapack-3.9.0 |31_h7ac8fdf_openblas 16 KB conda-forge 2025-05-07T20:26:34.0203471Z libopenblas-0.3.29 |pthreads_h94d23a6_0 5.6 MB conda-forge 2025-05-07T20:26:34.0203940Z numpy-2.2.5 | py312h72c5963_0 8.1 MB conda-forge 2025-05-07T20:26:34.0204342Z ------------------------------------------------------------ 2025-05-07T20:26:34.0204700Z Total: 15.4 MB 2025-05-07T20:26:34.0205238Z 2025-05-07T20:26:34.0205373Z The following NEW packages will be INSTALLED: 2025-05-07T20:26:34.0205603Z 2025-05-07T20:26:34.0205827Z libblas conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 2025-05-07T20:26:34.0206581Z libcblas conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 2025-05-07T20:26:34.0207102Z libgfortran conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 2025-05-07T20:26:34.0207616Z libgfortran5 conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 2025-05-07T20:26:34.0208148Z liblapack conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 2025-05-07T20:26:34.0208715Z libopenblas conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 2025-05-07T20:26:34.0209461Z numpy conda-forge/linux-64::numpy-2.2.5-py312h72c5963_0 2025-05-07T20:26:34.0209749Z 2025-05-07T20:26:34.0209753Z 2025-05-07T20:26:34.0209757Z 2025-05-07T20:26:34.0209910Z Downloading and Extracting Packages: ...working... 2025-05-07T20:26:34.0210297Z numpy-2.2.5 | 8.1 MB | | 0% 2025-05-07T20:26:34.0210526Z 2025-05-07T20:26:34.0212649Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:26:34.0212907Z 2025-05-07T20:26:34.0212911Z 2025-05-07T20:26:34.0225018Z libgfortran5-15.1.0 | 1.5 MB | | 0%  2025-05-07T20:26:34.0225285Z 2025-05-07T20:26:34.0225289Z 2025-05-07T20:26:34.0225293Z 2025-05-07T20:26:34.0241249Z libgfortran-15.1.0 | 34 KB | | 0%  2025-05-07T20:26:34.0241520Z 2025-05-07T20:26:34.0241524Z 2025-05-07T20:26:34.0241528Z 2025-05-07T20:26:34.0241531Z 2025-05-07T20:26:34.0267898Z libblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:26:34.0268208Z 2025-05-07T20:26:34.0268225Z 2025-05-07T20:26:34.0268229Z 2025-05-07T20:26:34.0268232Z 2025-05-07T20:26:34.0268675Z 2025-05-07T20:26:34.0270441Z libcblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:26:34.0270816Z 2025-05-07T20:26:34.0270821Z 2025-05-07T20:26:34.0270825Z 2025-05-07T20:26:34.0270829Z 2025-05-07T20:26:34.0270832Z 2025-05-07T20:26:34.0270836Z 2025-05-07T20:26:34.0896312Z liblapack-3.9.0 | 16 KB | | 0%  2025-05-07T20:26:34.0896619Z 2025-05-07T20:26:34.0896624Z 2025-05-07T20:26:34.0896628Z 2025-05-07T20:26:34.0902520Z 2025-05-07T20:26:34.1224226Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:34.1224531Z 2025-05-07T20:26:34.1227460Z 2025-05-07T20:26:34.1885228Z libgfortran5-15.1.0 | 1.5 MB | ##1 | 22%  2025-05-07T20:26:34.1885528Z 2025-05-07T20:26:34.1885533Z 2025-05-07T20:26:34.1885538Z 2025-05-07T20:26:34.1885544Z 2025-05-07T20:26:34.1885554Z 2025-05-07T20:26:34.2363828Z libcblas-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:26:34.2364173Z 2025-05-07T20:26:34.2364177Z 2025-05-07T20:26:34.2364182Z 2025-05-07T20:26:34.2364187Z 2025-05-07T20:26:34.2364191Z 2025-05-07T20:26:34.2466528Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:34.2466805Z 2025-05-07T20:26:34.2473211Z 2025-05-07T20:26:34.3072075Z libgfortran5-15.1.0 | 1.5 MB | ####3 | 44%  2025-05-07T20:26:34.3072402Z 2025-05-07T20:26:34.3072407Z 2025-05-07T20:26:34.3072410Z 2025-05-07T20:26:34.3072415Z 2025-05-07T20:26:34.3072420Z 2025-05-07T20:26:34.3123014Z 2025-05-07T20:26:34.3227891Z liblapack-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:26:34.3234510Z 2025-05-07T20:26:34.3242699Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:26:34.3243044Z 2025-05-07T20:26:34.3243050Z 2025-05-07T20:26:34.3243055Z 2025-05-07T20:26:34.3243060Z 2025-05-07T20:26:34.3243065Z 2025-05-07T20:26:34.3243099Z 2025-05-07T20:26:34.3253985Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:34.3254361Z 2025-05-07T20:26:34.3254367Z 2025-05-07T20:26:34.3254372Z 2025-05-07T20:26:34.3320255Z libgfortran-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:26:34.3320694Z 2025-05-07T20:26:34.3320701Z 2025-05-07T20:26:34.3333623Z 2025-05-07T20:26:34.3712845Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:26:34.3713145Z 2025-05-07T20:26:34.3713149Z 2025-05-07T20:26:34.3715427Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:26:34.3715700Z 2025-05-07T20:26:34.3715705Z 2025-05-07T20:26:34.3762605Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:26:34.3767140Z numpy-2.2.5 | 8.1 MB | | 0% 2025-05-07T20:26:34.3767503Z 2025-05-07T20:26:34.3767510Z 2025-05-07T20:26:34.3767515Z 2025-05-07T20:26:34.3767705Z 2025-05-07T20:26:34.3787571Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:34.3787852Z 2025-05-07T20:26:34.3787856Z 2025-05-07T20:26:34.3787860Z 2025-05-07T20:26:34.3789524Z 2025-05-07T20:26:34.3956349Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:34.3956637Z 2025-05-07T20:26:34.3956641Z 2025-05-07T20:26:34.3956645Z 2025-05-07T20:26:34.3956649Z 2025-05-07T20:26:34.3956652Z 2025-05-07T20:26:34.4074812Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:34.4075084Z 2025-05-07T20:26:34.4075089Z 2025-05-07T20:26:34.4075092Z 2025-05-07T20:26:34.4075096Z 2025-05-07T20:26:34.4075100Z 2025-05-07T20:26:34.4077655Z 2025-05-07T20:26:34.4230126Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:34.4231978Z 2025-05-07T20:26:34.4777013Z libopenblas-0.3.29 | 5.6 MB | #####3 | 53%  2025-05-07T20:26:34.4798308Z numpy-2.2.5 | 8.1 MB | ##8 | 29% 2025-05-07T20:26:34.4798625Z 2025-05-07T20:26:34.4798656Z 2025-05-07T20:26:34.4799743Z 2025-05-07T20:26:34.4803975Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:26:34.4804250Z 2025-05-07T20:26:34.4804255Z 2025-05-07T20:26:34.4805561Z 2025-05-07T20:26:34.4876806Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:26:34.4877113Z 2025-05-07T20:26:34.5203395Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:26:34.5203775Z 2025-05-07T20:26:34.5203780Z 2025-05-07T20:26:34.5782292Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:26:34.5907873Z numpy-2.2.5 | 8.1 MB | ####7 | 48% 2025-05-07T20:26:34.5908200Z 2025-05-07T20:26:34.5909042Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:26:34.5909400Z 2025-05-07T20:26:34.6322673Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:26:35.0017412Z numpy-2.2.5 | 8.1 MB | ########## | 100% 2025-05-07T20:26:35.0017975Z numpy-2.2.5 | 8.1 MB | ########## | 100% 2025-05-07T20:26:35.0025465Z numpy-2.2.5 | 8.1 MB | ########## | 100% 2025-05-07T20:26:35.0025953Z 2025-05-07T20:26:35.0026257Z 2025-05-07T20:26:35.0026673Z  2025-05-07T20:26:35.0026936Z 2025-05-07T20:26:35.0026940Z 2025-05-07T20:26:35.0027111Z  2025-05-07T20:26:35.0027311Z 2025-05-07T20:26:35.0027315Z 2025-05-07T20:26:35.0027319Z 2025-05-07T20:26:35.0027492Z  2025-05-07T20:26:35.0027693Z 2025-05-07T20:26:35.0027697Z 2025-05-07T20:26:35.0027701Z 2025-05-07T20:26:35.0027705Z 2025-05-07T20:26:35.0027912Z  2025-05-07T20:26:35.0028159Z 2025-05-07T20:26:35.0028163Z 2025-05-07T20:26:35.0028167Z 2025-05-07T20:26:35.0028171Z 2025-05-07T20:26:35.0028181Z 2025-05-07T20:26:35.0028372Z  2025-05-07T20:26:35.0028580Z 2025-05-07T20:26:35.0028584Z 2025-05-07T20:26:35.0028588Z 2025-05-07T20:26:35.0028826Z 2025-05-07T20:26:35.0028831Z 2025-05-07T20:26:35.0028835Z 2025-05-07T20:26:35.0029022Z  done 2025-05-07T20:26:35.1033139Z Preparing transaction: \ done 2025-05-07T20:26:35.2036120Z Verifying transaction: / done 2025-05-07T20:26:35.3044859Z Executing transaction: \ done 2025-05-07T20:26:35.4845589Z ################################################################################ 2025-05-07T20:26:35.4845994Z # Install Package From PyTorch PIP: torch 2025-05-07T20:26:35.4846287Z # 2025-05-07T20:26:35.4862941Z # [2025-05-07T20:26:35.485Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.8.0 2025-05-07T20:26:35.4863441Z ################################################################################ 2025-05-07T20:26:35.4864136Z 2025-05-07T20:26:35.4878514Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:26:35.5801291Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:26:35.5801772Z ################################################################################ 2025-05-07T20:26:35.5802186Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:26:35.5802518Z # 2025-05-07T20:26:35.5819584Z # [2025-05-07T20:26:35.581Z] + __prepare_pip_arguments torch nightly cuda/12.8.0 2025-05-07T20:26:35.5820131Z ################################################################################ 2025-05-07T20:26:35.5820351Z 2025-05-07T20:26:35.5841049Z [INSTALL] Extracted package (channel, version): (nightly, LATEST) 2025-05-07T20:26:35.5869498Z [INSTALL] Extracted package variant: cu128 2025-05-07T20:26:35.5886454Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:26:35.5887088Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu128/ 2025-05-07T20:26:35.5895419Z [INSTALL] Extracted the full PIP package: --pre torch 2025-05-07T20:26:35.5904384Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu128/ ... 2025-05-07T20:26:35.5926744Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128/ 2025-05-07T20:28:12.3390805Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu128/ 2025-05-07T20:28:12.3391434Z Collecting torch 2025-05-07T20:28:12.3392508Z Downloading https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (30 kB) 2025-05-07T20:28:12.3393670Z Collecting filelock (from torch) 2025-05-07T20:28:12.3394407Z Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB) 2025-05-07T20:28:12.3395920Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from torch) (4.13.2) 2025-05-07T20:28:12.3397565Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from torch) (78.1.1) 2025-05-07T20:28:12.3398591Z Collecting sympy>=1.13.3 (from torch) 2025-05-07T20:28:12.3399335Z Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB) 2025-05-07T20:28:12.3400558Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 51.0 MB/s eta 0:00:00 2025-05-07T20:28:12.3401072Z Collecting networkx (from torch) 2025-05-07T20:28:12.3401794Z Downloading https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB) 2025-05-07T20:28:12.3402738Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 19.5 MB/s eta 0:00:00 2025-05-07T20:28:12.3403240Z Collecting jinja2 (from torch) 2025-05-07T20:28:12.3403995Z Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB) 2025-05-07T20:28:12.3404752Z Collecting fsspec (from torch) 2025-05-07T20:28:12.3405473Z Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB) 2025-05-07T20:28:12.3406624Z Collecting nvidia-cuda-nvrtc-cu12==12.8.61 (from torch) 2025-05-07T20:28:12.3408495Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:28:12.3409733Z Collecting nvidia-cuda-runtime-cu12==12.8.57 (from torch) 2025-05-07T20:28:12.3411004Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:28:12.3412339Z Collecting nvidia-cuda-cupti-cu12==12.8.57 (from torch) 2025-05-07T20:28:12.3413554Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:28:12.3414964Z Collecting nvidia-cudnn-cu12==9.8.0.87 (from torch) 2025-05-07T20:28:12.3416000Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl.metadata (1.8 kB) 2025-05-07T20:28:12.3417097Z Collecting nvidia-cublas-cu12==12.8.3.14 (from torch) 2025-05-07T20:28:12.3418185Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:28:12.3419261Z Collecting nvidia-cufft-cu12==11.3.3.41 (from torch) 2025-05-07T20:28:12.3420464Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:28:12.3421663Z Collecting nvidia-curand-cu12==10.3.9.55 (from torch) 2025-05-07T20:28:12.3422743Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:28:12.3423850Z Collecting nvidia-cusolver-cu12==11.7.2.55 (from torch) 2025-05-07T20:28:12.3424966Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:28:12.3426164Z Collecting nvidia-cusparse-cu12==12.5.7.53 (from torch) 2025-05-07T20:28:12.3427398Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:28:12.3428638Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch) 2025-05-07T20:28:12.3429737Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl.metadata (6.8 kB) 2025-05-07T20:28:12.3430832Z Collecting nvidia-nccl-cu12==2.26.2 (from torch) 2025-05-07T20:28:12.3432007Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB) 2025-05-07T20:28:12.3433214Z Collecting nvidia-nvtx-cu12==12.8.55 (from torch) 2025-05-07T20:28:12.3434410Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:28:12.3435648Z Collecting nvidia-nvjitlink-cu12==12.8.61 (from torch) 2025-05-07T20:28:12.3436892Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:28:12.3438123Z Collecting nvidia-cufile-cu12==1.13.0.11 (from torch) 2025-05-07T20:28:12.3439333Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:28:12.3440584Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch) 2025-05-07T20:28:12.3441848Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:28:12.3443100Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch) 2025-05-07T20:28:12.3444150Z Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB) 2025-05-07T20:28:12.3445245Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 4.9 MB/s eta 0:00:00 2025-05-07T20:28:12.3445790Z Collecting MarkupSafe>=2.0 (from jinja2->torch) 2025-05-07T20:28:12.3446856Z Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (28 kB) 2025-05-07T20:28:12.3448497Z Downloading https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp312-cp312-manylinux_2_28_x86_64.whl (1047.0 MB) 2025-05-07T20:28:12.3449722Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 GB 21.0 MB/s eta 0:00:00 2025-05-07T20:28:12.3450938Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl (609.6 MB) 2025-05-07T20:28:12.3452248Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 609.6/609.6 MB 53.0 MB/s eta 0:00:00 2025-05-07T20:28:12.3453460Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (10.2 MB) 2025-05-07T20:28:12.3454774Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.2/10.2 MB 169.6 MB/s eta 0:00:00 2025-05-07T20:28:12.3455985Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (88.0 MB) 2025-05-07T20:28:12.3457303Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88.0/88.0 MB 144.8 MB/s eta 0:00:00 2025-05-07T20:28:12.3458531Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (954 kB) 2025-05-07T20:28:12.3459864Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 954.8/954.8 kB 93.5 MB/s eta 0:00:00 2025-05-07T20:28:12.3460890Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl (698.0 MB) 2025-05-07T20:28:12.3462080Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 698.0/698.0 MB 44.4 MB/s eta 0:00:00 2025-05-07T20:28:12.3463228Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (193.1 MB) 2025-05-07T20:28:12.3464485Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 193.1/193.1 MB 105.8 MB/s eta 0:00:00 2025-05-07T20:28:12.3465675Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.2 MB) 2025-05-07T20:28:12.3466933Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 100.6 MB/s eta 0:00:00 2025-05-07T20:28:12.3467975Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl (63.6 MB) 2025-05-07T20:28:12.3469152Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.6/63.6 MB 146.4 MB/s eta 0:00:00 2025-05-07T20:28:12.3470219Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl (260.4 MB) 2025-05-07T20:28:12.3471670Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 260.4/260.4 MB 127.9 MB/s eta 0:00:00 2025-05-07T20:28:12.3472890Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (292.1 MB) 2025-05-07T20:28:12.3474293Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 292.1/292.1 MB 121.3 MB/s eta 0:00:00 2025-05-07T20:28:12.3475334Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB) 2025-05-07T20:28:12.3476499Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 136.4 MB/s eta 0:00:00 2025-05-07T20:28:12.3477804Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB) 2025-05-07T20:28:12.3479064Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 130.6 MB/s eta 0:00:00 2025-05-07T20:28:12.3480219Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.2 MB) 2025-05-07T20:28:12.3481483Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.2/39.2 MB 153.7 MB/s eta 0:00:00 2025-05-07T20:28:12.3482569Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89 kB) 2025-05-07T20:28:12.3484270Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.5 MB) 2025-05-07T20:28:12.3485588Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.5/153.5 MB 130.8 MB/s eta 0:00:00 2025-05-07T20:28:12.3488170Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch 2025-05-07T20:28:12.3490597Z 2025-05-07T20:28:12.3493649Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.8.3.14 nvidia-cuda-cupti-cu12-12.8.57 nvidia-cuda-nvrtc-cu12-12.8.61 nvidia-cuda-runtime-cu12-12.8.57 nvidia-cudnn-cu12-9.8.0.87 nvidia-cufft-cu12-11.3.3.41 nvidia-cufile-cu12-1.13.0.11 nvidia-curand-cu12-10.3.9.55 nvidia-cusolver-cu12-11.7.2.55 nvidia-cusparse-cu12-12.5.7.53 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.8.61 nvidia-nvtx-cu12-12.8.55 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu128 2025-05-07T20:28:12.3496755Z 2025-05-07T20:28:14.5530761Z torch 2.8.0.dev20250507+cu128 2025-05-07T20:28:14.5533247Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu128) 2025-05-07T20:28:18.0687582Z [CHECK] Python (sub-)package 'torch.distributed' found ... 2025-05-07T20:28:21.6346036Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu128 2025-05-07T20:28:21.6346626Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ... 2025-05-07T20:28:25.0705033Z True 2025-05-07T20:28:25.0705286Z True 2025-05-07T20:28:25.0705392Z 2025-05-07T20:28:25.1333039Z [INSTALL] Successfully installed PyTorch through PyTorch PIP 2025-05-07T20:28:25.1380211Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:25.1380813Z if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:25.1393355Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:25.1393756Z env: 2025-05-07T20:28:25.1393989Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:25.1394295Z BUILD_ENV: build_binary 2025-05-07T20:28:25.1394730Z BUILD_TARGET: genai 2025-05-07T20:28:25.1394962Z BUILD_VARIANT: cuda 2025-05-07T20:28:25.1395201Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:28:25.1395455Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:25.1395760Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:25.1396103Z ##[endgroup] 2025-05-07T20:28:25.4785460Z /home/ec2-user/miniconda/bin/conda 2025-05-07T20:28:25.4787480Z ################################################################################ 2025-05-07T20:28:25.4787947Z # Collect PyTorch Environment Information (for Reporting Issues) 2025-05-07T20:28:25.4788304Z # 2025-05-07T20:28:25.4803434Z # [2025-05-07T20:28:25.480Z] + collect_pytorch_env_info build_binary 2025-05-07T20:28:25.4803953Z ################################################################################ 2025-05-07T20:28:25.4804174Z 2025-05-07T20:28:25.4819272Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:25.5740553Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:25.5751113Z [INFO] Downloading the PyTorch environment info collection script ... 2025-05-07T20:28:25.5751750Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py 2025-05-07T20:28:25.5752163Z 2025-05-07T20:28:25.6635635Z 2025-05-07T20:28:25.6636319Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ... 2025-05-07T20:28:25.6659588Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python collect_env.py 2025-05-07T20:28:31.6117354Z Collecting environment information... 2025-05-07T20:28:31.6117895Z PyTorch version: 2.8.0.dev20250507+cu128 2025-05-07T20:28:31.6118290Z Is debug build: False 2025-05-07T20:28:31.6118540Z CUDA used to build PyTorch: 12.8 2025-05-07T20:28:31.6118816Z ROCM used to build PyTorch: N/A 2025-05-07T20:28:31.6118990Z 2025-05-07T20:28:31.6119101Z OS: Amazon Linux 2023.6.20250317 (x86_64) 2025-05-07T20:28:31.6119497Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:28:31.6119965Z Clang version: Could not collect 2025-05-07T20:28:31.6120362Z CMake version: Could not collect 2025-05-07T20:28:31.6120721Z Libc version: glibc-2.34 2025-05-07T20:28:31.6120941Z 2025-05-07T20:28:31.6121296Z Python version: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0] (64-bit runtime) 2025-05-07T20:28:31.6121999Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34 2025-05-07T20:28:31.6122497Z Is CUDA available: True 2025-05-07T20:28:31.6122753Z CUDA runtime version: 12.8.61 2025-05-07T20:28:31.6123020Z CUDA_MODULE_LOADING set to: LAZY 2025-05-07T20:28:31.6123330Z GPU models and configuration: GPU 0: NVIDIA A10G 2025-05-07T20:28:31.6123653Z Nvidia driver version: 570.133.07 2025-05-07T20:28:31.6123929Z cuDNN version: Could not collect 2025-05-07T20:28:31.6124197Z HIP runtime version: N/A 2025-05-07T20:28:31.6124442Z MIOpen runtime version: N/A 2025-05-07T20:28:31.6124738Z Is XNNPACK available: True 2025-05-07T20:28:31.6124963Z 2025-05-07T20:28:31.6125062Z CPU: 2025-05-07T20:28:31.6125281Z Architecture: x86_64 2025-05-07T20:28:31.6125615Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:28:31.6126009Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:28:31.6126398Z Byte Order: Little Endian 2025-05-07T20:28:31.6126712Z CPU(s): 16 2025-05-07T20:28:31.6127010Z On-line CPU(s) list: 0-15 2025-05-07T20:28:31.6127777Z Vendor ID: AuthenticAMD 2025-05-07T20:28:31.6128124Z Model name: AMD EPYC 7R32 2025-05-07T20:28:31.6128447Z CPU family: 23 2025-05-07T20:28:31.6128735Z Model: 49 2025-05-07T20:28:31.6129023Z Thread(s) per core: 2 2025-05-07T20:28:31.6129310Z Core(s) per socket: 8 2025-05-07T20:28:31.6129594Z Socket(s): 1 2025-05-07T20:28:31.6130022Z Stepping: 0 2025-05-07T20:28:31.6130321Z BogoMIPS: 5600.00 2025-05-07T20:28:31.6132627Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:28:31.6134775Z Hypervisor vendor: KVM 2025-05-07T20:28:31.6135090Z Virtualization type: full 2025-05-07T20:28:31.6135430Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:28:31.6135795Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:28:31.6136154Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:28:31.6136505Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:28:31.6136829Z NUMA node(s): 1 2025-05-07T20:28:31.6137117Z NUMA node0 CPU(s): 0-15 2025-05-07T20:28:31.6137456Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:28:31.6137840Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:28:31.6138199Z Vulnerability L1tf: Not affected 2025-05-07T20:28:31.6138555Z Vulnerability Mds: Not affected 2025-05-07T20:28:31.6138912Z Vulnerability Meltdown: Not affected 2025-05-07T20:28:31.6139266Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:28:31.6139673Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:28:31.6140231Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:28:31.6140824Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:28:31.6141374Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:28:31.6142064Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:28:31.6142939Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:28:31.6143630Z Vulnerability Srbds: Not affected 2025-05-07T20:28:31.6143992Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:28:31.6144229Z 2025-05-07T20:28:31.6144332Z Versions of relevant libraries: 2025-05-07T20:28:31.6144600Z [pip3] numpy==2.2.5 2025-05-07T20:28:31.6144846Z [pip3] nvidia-cublas-cu12==12.8.3.14 2025-05-07T20:28:31.6145149Z [pip3] nvidia-cuda-cupti-cu12==12.8.57 2025-05-07T20:28:31.6145468Z [pip3] nvidia-cuda-nvrtc-cu12==12.8.61 2025-05-07T20:28:31.6145791Z [pip3] nvidia-cuda-runtime-cu12==12.8.57 2025-05-07T20:28:31.6146101Z [pip3] nvidia-cudnn-cu12==9.8.0.87 2025-05-07T20:28:31.6146395Z [pip3] nvidia-cufft-cu12==11.3.3.41 2025-05-07T20:28:31.6146692Z [pip3] nvidia-curand-cu12==10.3.9.55 2025-05-07T20:28:31.6146988Z [pip3] nvidia-cusolver-cu12==11.7.2.55 2025-05-07T20:28:31.6147299Z [pip3] nvidia-cusparse-cu12==12.5.7.53 2025-05-07T20:28:31.6147735Z [pip3] nvidia-cusparselt-cu12==0.6.3 2025-05-07T20:28:31.6148033Z [pip3] nvidia-nccl-cu12==2.26.2 2025-05-07T20:28:31.6148323Z [pip3] nvidia-nvjitlink-cu12==12.8.61 2025-05-07T20:28:31.6148630Z [pip3] nvidia-nvtx-cu12==12.8.55 2025-05-07T20:28:31.6148921Z [pip3] pytorch-triton==3.3.0+git96316ce5 2025-05-07T20:28:31.6149221Z [pip3] torch==2.8.0.dev20250507+cu128 2025-05-07T20:28:31.6149600Z [conda] cuda-cudart 12.8.57 h5888daf_1 conda-forge 2025-05-07T20:28:31.6150177Z [conda] cuda-cudart-dev 12.8.57 h5888daf_1 conda-forge 2025-05-07T20:28:31.6150689Z [conda] cuda-cudart-dev_linux-64 12.8.57 h3f2d84a_1 conda-forge 2025-05-07T20:28:31.6151221Z [conda] cuda-cudart-static 12.8.57 h5888daf_1 conda-forge 2025-05-07T20:28:31.6151765Z [conda] cuda-cudart-static_linux-64 12.8.57 h3f2d84a_1 conda-forge 2025-05-07T20:28:31.6152306Z [conda] cuda-cudart_linux-64 12.8.57 h3f2d84a_1 conda-forge 2025-05-07T20:28:31.6152792Z [conda] cuda-cupti 12.8.57 hbd13f7d_0 conda-forge 2025-05-07T20:28:31.6153267Z [conda] cuda-cupti-dev 12.8.57 h5888daf_0 conda-forge 2025-05-07T20:28:31.6153754Z [conda] cuda-libraries 12.8.0 ha770c72_0 conda-forge 2025-05-07T20:28:31.6154252Z [conda] cuda-libraries-dev 12.8.0 ha770c72_0 conda-forge 2025-05-07T20:28:31.6154736Z [conda] cuda-nvrtc 12.8.61 hbd13f7d_0 conda-forge 2025-05-07T20:28:31.6155283Z [conda] cuda-nvrtc-dev 12.8.61 h5888daf_0 conda-forge 2025-05-07T20:28:31.6155786Z [conda] cuda-nvtx 12.8.55 hbd13f7d_0 conda-forge 2025-05-07T20:28:31.6156244Z [conda] cuda-opencl 12.8.55 hbd13f7d_0 conda-forge 2025-05-07T20:28:31.6156725Z [conda] cuda-opencl-dev 12.8.55 h5888daf_0 conda-forge 2025-05-07T20:28:31.6157215Z [conda] cuda-runtime 12.8.0 ha804496_0 conda-forge 2025-05-07T20:28:31.6157680Z [conda] libcublas 12.8.3.14 h9ab20c4_0 conda-forge 2025-05-07T20:28:31.6158153Z [conda] libcublas-dev 12.8.3.14 h9ab20c4_0 conda-forge 2025-05-07T20:28:31.6158625Z [conda] libcufft 11.3.3.41 hbd13f7d_0 conda-forge 2025-05-07T20:28:31.6159091Z [conda] libcufft-dev 11.3.3.41 h5888daf_0 conda-forge 2025-05-07T20:28:31.6159563Z [conda] libcurand 10.3.9.55 hbd13f7d_0 conda-forge 2025-05-07T20:28:31.6160036Z [conda] libcurand-dev 10.3.9.55 h5888daf_0 conda-forge 2025-05-07T20:28:31.6160518Z [conda] libcusolver 11.7.2.55 h9ab20c4_0 conda-forge 2025-05-07T20:28:31.6161006Z [conda] libcusolver-dev 11.7.2.55 h9ab20c4_0 conda-forge 2025-05-07T20:28:31.6161551Z [conda] libcusparse 12.5.7.53 hbd13f7d_0 conda-forge 2025-05-07T20:28:31.6162044Z [conda] libcusparse-dev 12.5.7.53 h5888daf_0 conda-forge 2025-05-07T20:28:31.6162535Z [conda] libnvjitlink 12.8.61 hbd13f7d_0 conda-forge 2025-05-07T20:28:31.6163021Z [conda] libnvjitlink-dev 12.8.61 h5888daf_0 conda-forge 2025-05-07T20:28:31.6163488Z [conda] numpy 2.2.5 py312h72c5963_0 conda-forge 2025-05-07T20:28:31.6163956Z [conda] nvidia-cublas-cu12 12.8.3.14 pypi_0 pypi 2025-05-07T20:28:31.6164462Z [conda] nvidia-cuda-cupti-cu12 12.8.57 pypi_0 pypi 2025-05-07T20:28:31.6164964Z [conda] nvidia-cuda-nvrtc-cu12 12.8.61 pypi_0 pypi 2025-05-07T20:28:31.6165475Z [conda] nvidia-cuda-runtime-cu12 12.8.57 pypi_0 pypi 2025-05-07T20:28:31.6165971Z [conda] nvidia-cudnn-cu12 9.8.0.87 pypi_0 pypi 2025-05-07T20:28:31.6166546Z [conda] nvidia-cufft-cu12 11.3.3.41 pypi_0 pypi 2025-05-07T20:28:31.6167038Z [conda] nvidia-curand-cu12 10.3.9.55 pypi_0 pypi 2025-05-07T20:28:31.6167533Z [conda] nvidia-cusolver-cu12 11.7.2.55 pypi_0 pypi 2025-05-07T20:28:31.6168031Z [conda] nvidia-cusparse-cu12 12.5.7.53 pypi_0 pypi 2025-05-07T20:28:31.6168530Z [conda] nvidia-cusparselt-cu12 0.6.3 pypi_0 pypi 2025-05-07T20:28:31.6169112Z [conda] nvidia-nccl-cu12 2.26.2 pypi_0 pypi 2025-05-07T20:28:31.6169604Z [conda] nvidia-nvjitlink-cu12 12.8.61 pypi_0 pypi 2025-05-07T20:28:31.6170087Z [conda] nvidia-nvtx-cu12 12.8.55 pypi_0 pypi 2025-05-07T20:28:31.6170575Z [conda] pytorch-triton 3.3.0+git96316ce5 pypi_0 pypi 2025-05-07T20:28:31.6171046Z [conda] torch 2.8.0.dev20250507+cu128 pypi_0 pypi 2025-05-07T20:28:31.6171322Z 2025-05-07T20:28:31.6850998Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:28:31.6851687Z . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:28:31.6863737Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:31.6864102Z env: 2025-05-07T20:28:31.6864327Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:31.6864628Z BUILD_ENV: build_binary 2025-05-07T20:28:31.6864900Z BUILD_TARGET: genai 2025-05-07T20:28:31.6865133Z BUILD_VARIANT: cuda 2025-05-07T20:28:31.6865365Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:28:31.6865625Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:31.6865938Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:31.6866278Z ##[endgroup] 2025-05-07T20:28:32.0240348Z ################################################################################ 2025-05-07T20:28:32.0240801Z # Prepare FBGEMM-GPU Build 2025-05-07T20:28:32.0241308Z # 2025-05-07T20:28:32.0257200Z # [2025-05-07T20:28:32.025Z] + prepare_fbgemm_gpu_build build_binary 2025-05-07T20:28:32.0257768Z ################################################################################ 2025-05-07T20:28:32.0258040Z 2025-05-07T20:28:32.0274378Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:32.1189000Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:32.1210856Z [BUILD] Running git submodules update ... 2025-05-07T20:28:32.1232752Z [EXEC] [ATTEMPT 0/3] + git submodule sync 2025-05-07T20:28:32.1595076Z Synchronizing submodule url for '../external/asmjit' 2025-05-07T20:28:32.1596059Z Synchronizing submodule url for '../external/composable_kernel' 2025-05-07T20:28:32.1596964Z Synchronizing submodule url for '../external/cpuinfo' 2025-05-07T20:28:32.1597765Z Synchronizing submodule url for '../external/cutlass' 2025-05-07T20:28:32.1598587Z Synchronizing submodule url for '../external/googletest' 2025-05-07T20:28:32.1599474Z Synchronizing submodule url for '../external/hipify_torch' 2025-05-07T20:28:32.1600305Z Synchronizing submodule url for '../external/json' 2025-05-07T20:28:32.1632882Z [EXEC] [ATTEMPT 0/3] + git submodule update --init --recursive 2025-05-07T20:28:32.2187884Z [BUILD] Installing other build dependencies ... 2025-05-07T20:28:32.2210426Z [EXEC] [ATTEMPT 0/3] + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt 2025-05-07T20:28:34.6571140Z Collecting backports.tarfile (from -r requirements.txt (line 13)) 2025-05-07T20:28:34.6757212Z Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB) 2025-05-07T20:28:34.7777325Z Collecting build (from -r requirements.txt (line 14)) 2025-05-07T20:28:34.7825578Z Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB) 2025-05-07T20:28:35.0010504Z Collecting cmake (from -r requirements.txt (line 15)) 2025-05-07T20:28:35.0040120Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB) 2025-05-07T20:28:35.1207411Z Collecting click (from -r requirements.txt (line 16)) 2025-05-07T20:28:35.1228777Z Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB) 2025-05-07T20:28:35.4577700Z Collecting hypothesis (from -r requirements.txt (line 17)) 2025-05-07T20:28:35.4601948Z Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB) 2025-05-07T20:28:35.5164932Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 18)) (3.1.4) 2025-05-07T20:28:35.5168641Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 19)) (1.3.0) 2025-05-07T20:28:35.5968991Z Collecting ninja (from -r requirements.txt (line 20)) 2025-05-07T20:28:35.5996411Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB) 2025-05-07T20:28:35.6429762Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 21)) (2.2.5) 2025-05-07T20:28:35.7071010Z Collecting pyre-extensions (from -r requirements.txt (line 22)) 2025-05-07T20:28:35.7094962Z Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB) 2025-05-07T20:28:35.8373648Z Collecting pyyaml (from -r requirements.txt (line 23)) 2025-05-07T20:28:35.8400028Z Downloading PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB) 2025-05-07T20:28:35.9513799Z Collecting scikit-build (from -r requirements.txt (line 24)) 2025-05-07T20:28:35.9554600Z Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB) 2025-05-07T20:28:36.0087451Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 25)) (78.1.1) 2025-05-07T20:28:36.0758558Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26)) 2025-05-07T20:28:36.0784588Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB) 2025-05-07T20:28:36.1761313Z Collecting tabulate (from -r requirements.txt (line 27)) 2025-05-07T20:28:36.1787516Z Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB) 2025-05-07T20:28:36.3039237Z Collecting patchelf (from -r requirements.txt (line 28)) 2025-05-07T20:28:36.3062696Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB) 2025-05-07T20:28:36.4108499Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:36.4136588Z Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB) 2025-05-07T20:28:36.5194512Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:36.5226710Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB) 2025-05-07T20:28:36.6390337Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:36.6454189Z Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:36.7583701Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:36.7645162Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:36.8220777Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5) 2025-05-07T20:28:36.8763714Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:28:36.8791341Z Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB) 2025-05-07T20:28:36.9301160Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2) 2025-05-07T20:28:36.9806470Z Collecting distro (from scikit-build->-r requirements.txt (line 24)) 2025-05-07T20:28:36.9829482Z Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB) 2025-05-07T20:28:37.0347434Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1) 2025-05-07T20:28:37.0991443Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:28:37.1018118Z Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB) 2025-05-07T20:28:37.1627664Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB) 2025-05-07T20:28:37.2270334Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB) 2025-05-07T20:28:37.2816363Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB) 2025-05-07T20:28:37.8131886Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 52.4 MB/s eta 0:00:00 2025-05-07T20:28:37.8159527Z Downloading click-8.1.8-py3-none-any.whl (98 kB) 2025-05-07T20:28:37.8708229Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB) 2025-05-07T20:28:37.9334823Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB) 2025-05-07T20:28:37.9908929Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB) 2025-05-07T20:28:38.0570349Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB) 2025-05-07T20:28:38.1145430Z Downloading PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (767 kB) 2025-05-07T20:28:38.1827508Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 767.5/767.5 kB 7.7 MB/s eta 0:00:00 2025-05-07T20:28:38.1893344Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB) 2025-05-07T20:28:38.2395234Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB) 2025-05-07T20:28:38.2919076Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB) 2025-05-07T20:28:38.3405514Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB) 2025-05-07T20:28:38.3950502Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB) 2025-05-07T20:28:38.4503835Z Downloading packaging-25.0-py3-none-any.whl (66 kB) 2025-05-07T20:28:38.5021232Z Downloading distro-1.9.0-py3-none-any.whl (20 kB) 2025-05-07T20:28:38.5555920Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB) 2025-05-07T20:28:38.6003550Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB) 2025-05-07T20:28:38.6496577Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB) 2025-05-07T20:28:38.8152659Z Installing collected packages: sortedcontainers, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions 2025-05-07T20:28:41.1710007Z 2025-05-07T20:28:41.1755383Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 typing-inspect-0.9.0 2025-05-07T20:28:41.3482656Z ################################################################################ 2025-05-07T20:28:41.3483005Z # Install PyTorch (PyTorch PIP) 2025-05-07T20:28:41.3483266Z # 2025-05-07T20:28:41.3500203Z # [2025-05-07T20:28:41.349Z] + install_triton_pip build_binary 2025-05-07T20:28:41.3500599Z ################################################################################ 2025-05-07T20:28:41.3500816Z 2025-05-07T20:28:41.3501052Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ... 2025-05-07T20:28:41.3501490Z ################################################################################ 2025-05-07T20:28:41.3501854Z # Install Package From PyTorch PIP: pytorch-triton 2025-05-07T20:28:41.3502180Z # 2025-05-07T20:28:41.3517557Z # [2025-05-07T20:28:41.351Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:28:41.3518093Z ################################################################################ 2025-05-07T20:28:41.3518315Z 2025-05-07T20:28:41.3533029Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:41.4403949Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:41.4404405Z ################################################################################ 2025-05-07T20:28:41.4405008Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:28:41.4405301Z # 2025-05-07T20:28:41.4421652Z # [2025-05-07T20:28:41.441Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:28:41.4422146Z ################################################################################ 2025-05-07T20:28:41.4422377Z 2025-05-07T20:28:41.4470747Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8) 2025-05-07T20:28:41.4487424Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:28:41.4488168Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:41.4497258Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:28:41.4507556Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ... 2025-05-07T20:28:41.4529430Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:49.3020067Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. 2025-05-07T20:28:49.3021324Z torch 2.8.0.dev20250507+cu128 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible. 2025-05-07T20:28:49.3022021Z 2025-05-07T20:28:49.3022233Z Looking in indexes: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:49.3022659Z Collecting pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:28:49.3023481Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB) 2025-05-07T20:28:49.3024740Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB) 2025-05-07T20:28:49.3025856Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.5/166.5 MB 53.2 MB/s eta 0:00:00 2025-05-07T20:28:49.3026280Z Installing collected packages: pytorch-triton 2025-05-07T20:28:49.3026656Z Attempting uninstall: pytorch-triton 2025-05-07T20:28:49.3027060Z Found existing installation: pytorch-triton 3.3.0+git96316ce5 2025-05-07T20:28:49.3027496Z Uninstalling pytorch-triton-3.3.0+git96316ce5: 2025-05-07T20:28:49.3027930Z Successfully uninstalled pytorch-triton-3.3.0+git96316ce5 2025-05-07T20:28:49.3028384Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8 2025-05-07T20:28:49.3028657Z 2025-05-07T20:28:51.5054245Z [CHECK] Python (sub-)package 'triton' found ... 2025-05-07T20:28:51.5058047Z [CHECK] Printing out the pytorch-triton version ... 2025-05-07T20:28:53.6583983Z ################################################################################ 2025-05-07T20:28:53.6584442Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0 2025-05-07T20:28:53.6584869Z ################################################################################ 2025-05-07T20:28:53.6585085Z 2025-05-07T20:28:55.6993483Z [CHECK] Python (sub-)package 'numpy' found ... 2025-05-07T20:28:57.8735904Z [CHECK] Python (sub-)package 'skbuild' found ... 2025-05-07T20:28:57.8740092Z [BUILD] Successfully ran git submodules update 2025-05-07T20:28:57.8772454Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:28:57.8772961Z . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:28:57.8784594Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:57.8784944Z env: 2025-05-07T20:28:57.8785170Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:57.8785472Z BUILD_ENV: build_binary 2025-05-07T20:28:57.8785724Z BUILD_TARGET: genai 2025-05-07T20:28:57.8785958Z BUILD_VARIANT: cuda 2025-05-07T20:28:57.8786196Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:28:57.8786673Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:57.8786985Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:57.8787333Z ##[endgroup] 2025-05-07T20:28:58.2159091Z ################################################################################ 2025-05-07T20:28:58.2159526Z # Install FBGEMM-GPU from Wheel 2025-05-07T20:28:58.2159794Z # 2025-05-07T20:28:58.2174341Z # [2025-05-07T20:28:58.217Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:28:58.2175025Z ################################################################################ 2025-05-07T20:28:58.2175251Z 2025-05-07T20:28:58.2175625Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:28:58.2176343Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:28:58.2176694Z 2025-05-07T20:28:58.2336664Z f50ab0f907b8f67d4668daa75040e0b225eb54da fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:28:58.2338812Z 2025-05-07T20:28:58.2339708Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:28:58.2525415Z 2025-05-07T20:28:58.2526033Z 84bef3f6640ba9766f361e68bdc3f73d7442e779219a4c2a79fb3e077b76dfbc fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:28:58.2528414Z 2025-05-07T20:28:58.2528791Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:28:58.2529139Z 2025-05-07T20:28:58.2859215Z 9355e644e981da7f530670dbccbd5e53 fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:28:58.2861538Z 2025-05-07T20:28:58.2870713Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl ... 2025-05-07T20:28:58.2891893Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:01.0879849Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:01.0880865Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5) 2025-05-07T20:29:01.0881760Z Installing collected packages: fbgemm-gpu-genai-nightly 2025-05-07T20:29:01.0882206Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7 2025-05-07T20:29:01.0882480Z 2025-05-07T20:29:08.0845123Z ################################################################################ 2025-05-07T20:29:08.0846071Z [CHECK] !!!! INFO !!!! 2025-05-07T20:29:08.0846768Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu128 2025-05-07T20:29:08.0847549Z [CHECK] CUDA version reported by PyTorch is: 12.8 2025-05-07T20:29:08.0848125Z [CHECK] 2025-05-07T20:29:08.0848704Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU 2025-05-07T20:29:08.0849667Z [CHECK] package channel; the package may be broken at runtime!!! 2025-05-07T20:29:08.0850382Z ################################################################################ 2025-05-07T20:29:08.0850769Z 2025-05-07T20:29:08.0850980Z [INSTALL] Checking imports and symbols ... 2025-05-07T20:29:12.1107005Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:29:16.1376674Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:20.1501262Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:20.1504678Z [CHECK] Printing out the FBGEMM-GPU version ... 2025-05-07T20:29:32.1537439Z ################################################################################ 2025-05-07T20:29:32.1537879Z [CHECK] The installed FBGEMM TARGET is: genai 2025-05-07T20:29:32.1538225Z [CHECK] The installed FBGEMM VARIANT is: cuda 2025-05-07T20:29:32.1538577Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7 2025-05-07T20:29:32.1539306Z ################################################################################ 2025-05-07T20:29:32.1539533Z 2025-05-07T20:29:40.1384034Z ################################################################################ 2025-05-07T20:29:40.1384599Z [CHECK] FBGEMM_GPU Experimental Packages 2025-05-07T20:29:40.1386105Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils'] 2025-05-07T20:29:40.1387746Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__'] 2025-05-07T20:29:40.1388285Z ################################################################################ 2025-05-07T20:29:40.1388524Z 2025-05-07T20:29:40.1388684Z [INSTALL] Check for installation of Python sources ... 2025-05-07T20:29:44.1598547Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ... 2025-05-07T20:29:48.1504303Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ... 2025-05-07T20:29:52.2467847Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ... 2025-05-07T20:29:56.2494153Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ... 2025-05-07T20:29:56.2499316Z [INSTALL] Check for operator registrations ... 2025-05-07T20:30:00.1497425Z fbgemm.nccl_init 2025-05-07T20:30:00.1497644Z 2025-05-07T20:30:00.2116337Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init 2025-05-07T20:30:04.1174090Z fbgemm.gqa_attn_splitk 2025-05-07T20:30:04.1174317Z 2025-05-07T20:30:04.1802120Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk 2025-05-07T20:30:08.0741179Z fbgemm.rope_qkv_decoding 2025-05-07T20:30:08.0741438Z 2025-05-07T20:30:08.1367185Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding 2025-05-07T20:30:08.1367798Z [INSTALL] FBGEMM-GPU installation through wheel completed ... 2025-05-07T20:30:08.1404092Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:30:08.1404574Z . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:30:08.1418125Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:30:08.1418491Z env: 2025-05-07T20:30:08.1428351Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:30:08.1428679Z BUILD_ENV: build_binary 2025-05-07T20:30:08.1428927Z BUILD_TARGET: genai 2025-05-07T20:30:08.1429160Z BUILD_VARIANT: cuda 2025-05-07T20:30:08.1429424Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:30:08.1429711Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:30:08.1430014Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:30:08.1430371Z ##[endgroup] 2025-05-07T20:30:08.4796341Z ################################################################################ 2025-05-07T20:30:08.4796688Z # Test All FBGEMM-GPU Modules 2025-05-07T20:30:08.4796945Z # 2025-05-07T20:30:08.4814086Z # [2025-05-07T20:30:08.480Z] + test_all_fbgemm_gpu_modules build_binary 2025-05-07T20:30:08.4814506Z ################################################################################ 2025-05-07T20:30:08.4814722Z 2025-05-07T20:30:16.4134577Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda) 2025-05-07T20:30:16.4135186Z [TEST] Will be running tests specific to this target and variant ... 2025-05-07T20:30:16.4135590Z [TEST] Determined the test directories: 2025-05-07T20:30:16.4135920Z fbgemm_gpu/experimental/gen_ai/test 2025-05-07T20:30:16.4136236Z fbgemm_gpu/experimental/example/test 2025-05-07T20:30:16.4136552Z fbgemm_gpu/experimental/gemm/test 2025-05-07T20:30:16.4136747Z 2025-05-07T20:30:16.4143212Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ... 2025-05-07T20:30:16.4150213Z [TEST] Set environment variables for CUDA testing ... 2025-05-07T20:30:16.4150666Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES 2025-05-07T20:30:16.4150952Z 2025-05-07T20:30:16.8380549Z 2025-05-07T20:30:16.8380882Z [TEST] Installing PyTest ... 2025-05-07T20:30:16.8404797Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest 2025-05-07T20:30:17.9327381Z Channels: 2025-05-07T20:30:17.9327691Z - conda-forge 2025-05-07T20:30:17.9327986Z Platform: linux-64 2025-05-07T20:30:21.1923987Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:30:22.3319514Z Solving environment: \ | / done 2025-05-07T20:30:22.5601493Z 2025-05-07T20:30:22.5602080Z ## Package Plan ## 2025-05-07T20:30:22.5602400Z 2025-05-07T20:30:22.5602884Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:30:22.5603503Z 2025-05-07T20:30:22.5603714Z added / updated specs: 2025-05-07T20:30:22.5604003Z - expecttest 2025-05-07T20:30:22.5604257Z - pytest 2025-05-07T20:30:22.5604385Z 2025-05-07T20:30:22.5604391Z 2025-05-07T20:30:22.5604528Z The following packages will be downloaded: 2025-05-07T20:30:22.5604767Z 2025-05-07T20:30:22.5604885Z package | build 2025-05-07T20:30:22.5605225Z ---------------------------|----------------- 2025-05-07T20:30:22.5605624Z colorama-0.4.6 | pyhd8ed1ab_1 26 KB conda-forge 2025-05-07T20:30:22.5606103Z exceptiongroup-1.2.2 | pyhd8ed1ab_1 20 KB conda-forge 2025-05-07T20:30:22.5606818Z expecttest-0.3.0 | pyhd8ed1ab_0 14 KB conda-forge 2025-05-07T20:30:22.5607278Z iniconfig-2.0.0 | pyhd8ed1ab_1 11 KB conda-forge 2025-05-07T20:30:22.5607735Z packaging-25.0 | pyh29332c3_1 61 KB conda-forge 2025-05-07T20:30:22.5608172Z pluggy-1.5.0 | pyhd8ed1ab_1 23 KB conda-forge 2025-05-07T20:30:22.5608616Z pytest-8.3.5 | pyhd8ed1ab_0 254 KB conda-forge 2025-05-07T20:30:22.5609332Z tomli-2.2.1 | pyhd8ed1ab_1 19 KB conda-forge 2025-05-07T20:30:22.5609738Z ------------------------------------------------------------ 2025-05-07T20:30:22.5610084Z Total: 428 KB 2025-05-07T20:30:22.5610305Z 2025-05-07T20:30:22.5610434Z The following NEW packages will be INSTALLED: 2025-05-07T20:30:22.5610658Z 2025-05-07T20:30:22.5610869Z colorama conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 2025-05-07T20:30:22.5611383Z exceptiongroup conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 2025-05-07T20:30:22.5611919Z expecttest conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 2025-05-07T20:30:22.5612470Z iniconfig conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 2025-05-07T20:30:22.5612951Z packaging conda-forge/noarch::packaging-25.0-pyh29332c3_1 2025-05-07T20:30:22.5613403Z pluggy conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 2025-05-07T20:30:22.5613845Z pytest conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 2025-05-07T20:30:22.5614280Z tomli conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 2025-05-07T20:30:22.5614538Z 2025-05-07T20:30:22.5614543Z 2025-05-07T20:30:22.5614547Z 2025-05-07T20:30:22.5614701Z Downloading and Extracting Packages: ...working... 2025-05-07T20:30:22.5615082Z pytest-8.3.5 | 254 KB | | 0% 2025-05-07T20:30:22.5615313Z 2025-05-07T20:30:22.5615592Z packaging-25.0 | 61 KB | | 0%  2025-05-07T20:30:22.5615837Z 2025-05-07T20:30:22.5615841Z 2025-05-07T20:30:22.5625035Z colorama-0.4.6 | 26 KB | | 0%  2025-05-07T20:30:22.5625285Z 2025-05-07T20:30:22.5625296Z 2025-05-07T20:30:22.5625300Z 2025-05-07T20:30:22.5641469Z pluggy-1.5.0 | 23 KB | | 0%  2025-05-07T20:30:22.5641921Z 2025-05-07T20:30:22.5641926Z 2025-05-07T20:30:22.5641929Z 2025-05-07T20:30:22.5641933Z 2025-05-07T20:30:22.5650405Z exceptiongroup-1.2.2 | 20 KB | | 0%  2025-05-07T20:30:22.5650754Z 2025-05-07T20:30:22.5650759Z 2025-05-07T20:30:22.5650762Z 2025-05-07T20:30:22.5650766Z 2025-05-07T20:30:22.5650777Z 2025-05-07T20:30:22.5651565Z tomli-2.2.1 | 19 KB | | 0%  2025-05-07T20:30:22.5651813Z 2025-05-07T20:30:22.5653654Z 2025-05-07T20:30:22.5653901Z 2025-05-07T20:30:22.5653909Z 2025-05-07T20:30:22.5653913Z 2025-05-07T20:30:22.5653947Z 2025-05-07T20:30:22.5654330Z expecttest-0.3.0 | 14 KB | | 0%  2025-05-07T20:30:22.5654630Z 2025-05-07T20:30:22.5654644Z 2025-05-07T20:30:22.5654648Z 2025-05-07T20:30:22.5654652Z 2025-05-07T20:30:22.5654655Z 2025-05-07T20:30:22.5654659Z 2025-05-07T20:30:22.5654662Z 2025-05-07T20:30:22.6238596Z iniconfig-2.0.0 | 11 KB | | 0%  2025-05-07T20:30:22.6238952Z 2025-05-07T20:30:22.6238957Z 2025-05-07T20:30:22.6238961Z 2025-05-07T20:30:22.6250447Z 2025-05-07T20:30:22.6303498Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:22.6303941Z 2025-05-07T20:30:22.6303945Z 2025-05-07T20:30:22.6315502Z 2025-05-07T20:30:22.7162284Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:22.7162673Z 2025-05-07T20:30:22.7162680Z 2025-05-07T20:30:22.7162687Z 2025-05-07T20:30:22.7162694Z 2025-05-07T20:30:22.7162713Z 2025-05-07T20:30:22.7162756Z 2025-05-07T20:30:22.7213975Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:22.7214324Z 2025-05-07T20:30:22.7214328Z 2025-05-07T20:30:22.7214344Z 2025-05-07T20:30:22.7214348Z 2025-05-07T20:30:22.7215100Z 2025-05-07T20:30:22.7244044Z tomli-2.2.1 | 19 KB | ########5 | 85%  2025-05-07T20:30:22.7244426Z 2025-05-07T20:30:22.7244438Z 2025-05-07T20:30:22.7244442Z 2025-05-07T20:30:22.7244446Z 2025-05-07T20:30:22.7244450Z 2025-05-07T20:30:22.7244454Z 2025-05-07T20:30:22.7413470Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:22.7414300Z 2025-05-07T20:30:22.7414310Z 2025-05-07T20:30:22.7414315Z 2025-05-07T20:30:22.7414320Z 2025-05-07T20:30:22.7414325Z 2025-05-07T20:30:22.8389799Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:22.8390110Z 2025-05-07T20:30:22.8390118Z 2025-05-07T20:30:22.8390123Z 2025-05-07T20:30:22.8390129Z 2025-05-07T20:30:22.8390133Z 2025-05-07T20:30:22.8390138Z 2025-05-07T20:30:22.8393983Z 2025-05-07T20:30:22.8435801Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:22.8436157Z 2025-05-07T20:30:22.8436165Z 2025-05-07T20:30:22.8436173Z 2025-05-07T20:30:22.8436182Z 2025-05-07T20:30:22.8436190Z 2025-05-07T20:30:22.8436199Z 2025-05-07T20:30:22.8436245Z 2025-05-07T20:30:22.8668683Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:22.8668970Z 2025-05-07T20:30:22.8668975Z 2025-05-07T20:30:22.8668979Z 2025-05-07T20:30:22.8672759Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:22.8673025Z 2025-05-07T20:30:22.8673029Z 2025-05-07T20:30:22.8673033Z 2025-05-07T20:30:22.8680688Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:22.8680946Z 2025-05-07T20:30:22.8680950Z 2025-05-07T20:30:22.8680954Z 2025-05-07T20:30:22.8682001Z 2025-05-07T20:30:22.8687405Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:22.8687701Z 2025-05-07T20:30:22.8687705Z 2025-05-07T20:30:22.8687708Z 2025-05-07T20:30:22.8689120Z 2025-05-07T20:30:22.8728862Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:22.8729160Z 2025-05-07T20:30:22.8729164Z 2025-05-07T20:30:22.8729168Z 2025-05-07T20:30:22.8729171Z 2025-05-07T20:30:22.8729429Z 2025-05-07T20:30:22.8729572Z 2025-05-07T20:30:22.8843671Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:22.8844035Z 2025-05-07T20:30:22.8844041Z 2025-05-07T20:30:22.8844047Z 2025-05-07T20:30:22.8844088Z 2025-05-07T20:30:22.8844093Z 2025-05-07T20:30:22.8850019Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:22.8850386Z 2025-05-07T20:30:22.8850394Z 2025-05-07T20:30:22.8850401Z 2025-05-07T20:30:22.8850408Z 2025-05-07T20:30:22.8850417Z 2025-05-07T20:30:22.8850424Z 2025-05-07T20:30:22.8850432Z 2025-05-07T20:30:22.8944736Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:22.8945125Z 2025-05-07T20:30:22.8963850Z packaging-25.0 | 61 KB | ##6 | 26%  2025-05-07T20:30:22.8964221Z 2025-05-07T20:30:22.8990140Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:22.8990476Z 2025-05-07T20:30:22.8990481Z 2025-05-07T20:30:22.8998502Z colorama-0.4.6 | 26 KB | ###### | 61%  2025-05-07T20:30:22.8998776Z 2025-05-07T20:30:22.9000102Z 2025-05-07T20:30:22.9152683Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:22.9152949Z 2025-05-07T20:30:22.9175025Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:22.9175285Z 2025-05-07T20:30:22.9175289Z 2025-05-07T20:30:23.0137985Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:23.0265856Z pytest-8.3.5 | 254 KB | 6 | 6% 2025-05-07T20:30:23.0506856Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:23.0515098Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:23.0515441Z 2025-05-07T20:30:23.0515638Z 2025-05-07T20:30:23.0515905Z  2025-05-07T20:30:23.0516104Z 2025-05-07T20:30:23.0516108Z 2025-05-07T20:30:23.0516294Z  2025-05-07T20:30:23.0516502Z 2025-05-07T20:30:23.0516506Z 2025-05-07T20:30:23.0516510Z 2025-05-07T20:30:23.0516674Z  2025-05-07T20:30:23.0517099Z 2025-05-07T20:30:23.0517104Z 2025-05-07T20:30:23.0517108Z 2025-05-07T20:30:23.0517112Z 2025-05-07T20:30:23.0517284Z  2025-05-07T20:30:23.0517497Z 2025-05-07T20:30:23.0517500Z 2025-05-07T20:30:23.0517504Z 2025-05-07T20:30:23.0517508Z 2025-05-07T20:30:23.0517522Z 2025-05-07T20:30:23.0517697Z  2025-05-07T20:30:23.0517910Z 2025-05-07T20:30:23.0517914Z 2025-05-07T20:30:23.0517917Z 2025-05-07T20:30:23.0517921Z 2025-05-07T20:30:23.0517925Z 2025-05-07T20:30:23.0517928Z 2025-05-07T20:30:23.0518102Z  2025-05-07T20:30:23.0518322Z 2025-05-07T20:30:23.0518333Z 2025-05-07T20:30:23.0518337Z 2025-05-07T20:30:23.0518340Z 2025-05-07T20:30:23.0518344Z 2025-05-07T20:30:23.0518348Z 2025-05-07T20:30:23.0518351Z 2025-05-07T20:30:23.0518538Z  done 2025-05-07T20:30:23.1520660Z Preparing transaction: \ done 2025-05-07T20:30:23.2525564Z Verifying transaction: / done 2025-05-07T20:30:25.1556676Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:30:25.2810830Z [TEST] Checking imports ... 2025-05-07T20:30:29.2620881Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:30:29.2633196Z [TEST] Setting feature flags ... 2025-05-07T20:30:29.2633777Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1 2025-05-07T20:30:29.2634242Z 2025-05-07T20:30:29.6836183Z 2025-05-07T20:30:29.6837140Z [TEST] PyTest args: -v -rsx -s -W ignore::pytest.PytestCollectionWarning 2025-05-07T20:30:29.6839541Z ################################################################################ 2025-05-07T20:30:29.6840442Z # Run FBGEMM-GPU Tests: 2025-05-07T20:30:29.6841090Z # 2025-05-07T20:30:29.6860767Z # [2025-05-07T20:30:29.685Z] + __run_fbgemm_gpu_tests_in_directory build_binary 2025-05-07T20:30:29.6861371Z ################################################################################ 2025-05-07T20:30:29.6861677Z 2025-05-07T20:30:29.6868274Z [TEST] Enumerating ALL test files ... 2025-05-07T20:30:29.6898249Z ./attention/gqa_test.py 2025-05-07T20:30:29.6898620Z ./coalesce/coalesce_test.py 2025-05-07T20:30:29.6898999Z ./comm/multi_gpu_car_test.py 2025-05-07T20:30:29.6899381Z ./gather_scatter/gather_scatter_test.py 2025-05-07T20:30:29.6899787Z ./kv_cache/kv_cache_test.py 2025-05-07T20:30:29.6900051Z ./moe/activation_test.py 2025-05-07T20:30:29.6900308Z ./moe/gather_scatter_test.py 2025-05-07T20:30:29.6900559Z ./moe/layers_test.py 2025-05-07T20:30:29.6900796Z ./moe/shuffling_test.py 2025-05-07T20:30:29.6901056Z ./quantize/quantize_test.py 2025-05-07T20:30:29.6901221Z 2025-05-07T20:30:29.6901338Z [TEST] Enumerating IGNORED test files ... 2025-05-07T20:30:29.6901556Z 2025-05-07T20:30:29.6920190Z ################################################################################ 2025-05-07T20:30:29.6935807Z # [2025-05-07T20:30:29.693Z] Run Python Test Suite: 2025-05-07T20:30:29.6936281Z # ./attention/gqa_test.py 2025-05-07T20:30:29.6936662Z ################################################################################ 2025-05-07T20:30:29.6960681Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py 2025-05-07T20:30:29.6961475Z 2025-05-07T20:30:32.2255868Z ============================= test session starts ============================== 2025-05-07T20:30:32.2256662Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:30:32.2257284Z cachedir: .pytest_cache 2025-05-07T20:30:32.2257881Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:30:32.2258957Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:30:32.2259384Z plugins: hypothesis-6.131.14 2025-05-07T20:30:33.9172243Z collecting ... collected 2 items 2025-05-07T20:30:33.9172550Z 2025-05-07T20:31:11.8449991Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa( 2025-05-07T20:31:11.8450600Z self=, 2025-05-07T20:31:11.8450997Z int4_kv=False, 2025-05-07T20:31:11.8451261Z num_groups=1, 2025-05-07T20:31:11.8451503Z B=1, 2025-05-07T20:31:11.8451730Z MAX_T=4, 2025-05-07T20:31:11.8452071Z N_H_L=1, 2025-05-07T20:31:11.8452301Z ) 2025-05-07T20:31:11.8452538Z Trying example: test_gqa( 2025-05-07T20:31:11.8452896Z self=, 2025-05-07T20:31:11.8453320Z int4_kv=True, 2025-05-07T20:31:11.8453569Z num_groups=1, 2025-05-07T20:31:11.8453821Z B=1, 2025-05-07T20:31:11.8454054Z MAX_T=4, 2025-05-07T20:31:11.8454280Z N_H_L=1, 2025-05-07T20:31:11.8454507Z ) 2025-05-07T20:31:11.8454770Z Trying example: test_gqa( 2025-05-07T20:31:11.8455127Z self=, 2025-05-07T20:31:11.8455517Z int4_kv=True, 2025-05-07T20:31:11.8455768Z num_groups=4, 2025-05-07T20:31:11.8456008Z B=23, 2025-05-07T20:31:11.8456237Z MAX_T=33, 2025-05-07T20:31:11.8456474Z N_H_L=68, 2025-05-07T20:31:11.8456695Z ) 2025-05-07T20:31:11.8456927Z Trying example: test_gqa( 2025-05-07T20:31:11.8457277Z self=, 2025-05-07T20:31:11.8457656Z int4_kv=True, 2025-05-07T20:31:11.8457910Z num_groups=4, 2025-05-07T20:31:11.8458155Z B=77, 2025-05-07T20:31:11.8458379Z MAX_T=4, 2025-05-07T20:31:11.8458611Z N_H_L=1, 2025-05-07T20:31:11.8458857Z ) 2025-05-07T20:31:11.8459472Z Trying example: test_gqa( 2025-05-07T20:31:11.8459823Z self=, 2025-05-07T20:31:11.8460207Z int4_kv=True, 2025-05-07T20:31:11.8460448Z num_groups=4, 2025-05-07T20:31:11.8460694Z B=77, 2025-05-07T20:31:11.8460927Z MAX_T=52, 2025-05-07T20:31:11.8461154Z N_H_L=67, 2025-05-07T20:31:11.8461384Z ) 2025-05-07T20:31:11.8461613Z Trying example: test_gqa( 2025-05-07T20:31:11.8461954Z self=, 2025-05-07T20:31:11.8462339Z int4_kv=False, 2025-05-07T20:31:11.8462592Z num_groups=4, 2025-05-07T20:31:11.8462836Z B=57, 2025-05-07T20:31:11.8463052Z MAX_T=45, 2025-05-07T20:31:11.8463289Z N_H_L=120, 2025-05-07T20:31:11.8463521Z ) 2025-05-07T20:31:11.8463745Z Trying example: test_gqa( 2025-05-07T20:31:11.8464093Z self=, 2025-05-07T20:31:11.8464472Z int4_kv=True, 2025-05-07T20:31:11.8464714Z num_groups=4, 2025-05-07T20:31:11.8464967Z B=52, 2025-05-07T20:31:11.8465191Z MAX_T=42, 2025-05-07T20:31:11.8465417Z N_H_L=53, 2025-05-07T20:31:11.8465645Z ) 2025-05-07T20:31:11.8465874Z Trying example: test_gqa( 2025-05-07T20:31:11.8466217Z self=, 2025-05-07T20:31:11.8466607Z int4_kv=True, 2025-05-07T20:31:11.8466860Z num_groups=1, 2025-05-07T20:31:11.8467095Z B=77, 2025-05-07T20:31:11.8467317Z MAX_T=95, 2025-05-07T20:31:11.8467548Z N_H_L=53, 2025-05-07T20:31:11.8467769Z ) 2025-05-07T20:31:11.8467996Z Trying example: test_gqa( 2025-05-07T20:31:11.8468543Z self=, 2025-05-07T20:31:11.8468927Z int4_kv=True, 2025-05-07T20:31:11.8469175Z num_groups=4, 2025-05-07T20:31:11.8469412Z B=113, 2025-05-07T20:31:11.8469636Z MAX_T=48, 2025-05-07T20:31:11.8469867Z N_H_L=96, 2025-05-07T20:31:11.8470092Z ) 2025-05-07T20:31:11.8470320Z Trying example: test_gqa( 2025-05-07T20:31:11.8470666Z self=, 2025-05-07T20:31:11.8471052Z int4_kv=False, 2025-05-07T20:31:11.8471308Z num_groups=1, 2025-05-07T20:31:11.8471551Z B=51, 2025-05-07T20:31:11.8471766Z MAX_T=61, 2025-05-07T20:31:11.8471999Z N_H_L=69, 2025-05-07T20:31:11.8472438Z ) 2025-05-07T20:31:11.8472670Z Trying example: test_gqa( 2025-05-07T20:31:11.8473020Z self=, 2025-05-07T20:31:11.8473401Z int4_kv=False, 2025-05-07T20:31:11.8473647Z num_groups=4, 2025-05-07T20:31:11.8473892Z B=17, 2025-05-07T20:31:11.8474115Z MAX_T=113, 2025-05-07T20:31:11.8474345Z N_H_L=65, 2025-05-07T20:31:11.8474581Z ) 2025-05-07T20:31:11.8474809Z Trying example: test_gqa( 2025-05-07T20:31:11.8475155Z self=, 2025-05-07T20:31:11.8475533Z int4_kv=False, 2025-05-07T20:31:11.8475785Z num_groups=4, 2025-05-07T20:31:11.8476027Z B=17, 2025-05-07T20:31:11.8476244Z MAX_T=65, 2025-05-07T20:31:11.8476475Z N_H_L=65, 2025-05-07T20:31:11.8476708Z ) 2025-05-07T20:31:11.8476935Z Trying example: test_gqa( 2025-05-07T20:31:11.8477281Z self=, 2025-05-07T20:31:11.8477662Z int4_kv=False, 2025-05-07T20:31:11.8477909Z num_groups=4, 2025-05-07T20:31:11.8478161Z B=65, 2025-05-07T20:31:11.8478387Z MAX_T=65, 2025-05-07T20:31:11.8478613Z N_H_L=65, 2025-05-07T20:31:11.8478839Z ) 2025-05-07T20:31:11.8479069Z Trying example: test_gqa( 2025-05-07T20:31:11.8479410Z self=, 2025-05-07T20:31:11.8479793Z int4_kv=False, 2025-05-07T20:31:11.8480046Z num_groups=1, 2025-05-07T20:31:11.8480284Z B=6, 2025-05-07T20:31:11.8480509Z MAX_T=108, 2025-05-07T20:31:11.8480746Z N_H_L=14, 2025-05-07T20:31:11.8480968Z ) 2025-05-07T20:31:11.8481195Z Trying example: test_gqa( 2025-05-07T20:31:11.8481540Z self=, 2025-05-07T20:31:11.8481915Z int4_kv=False, 2025-05-07T20:31:11.8482168Z num_groups=1, 2025-05-07T20:31:11.8482514Z B=6, 2025-05-07T20:31:11.8482731Z MAX_T=14, 2025-05-07T20:31:11.8482963Z N_H_L=14, 2025-05-07T20:31:11.8483189Z ) 2025-05-07T20:31:11.8483419Z Trying example: test_gqa( 2025-05-07T20:31:11.8483765Z self=, 2025-05-07T20:31:11.8484149Z int4_kv=False, 2025-05-07T20:31:11.8484401Z num_groups=1, 2025-05-07T20:31:11.8484638Z B=6, 2025-05-07T20:31:11.8484860Z MAX_T=6, 2025-05-07T20:31:11.8485092Z N_H_L=14, 2025-05-07T20:31:11.8485313Z ) 2025-05-07T20:31:11.8485544Z Trying example: test_gqa( 2025-05-07T20:31:11.8485892Z self=, 2025-05-07T20:31:11.8486267Z int4_kv=False, 2025-05-07T20:31:11.8486518Z num_groups=1, 2025-05-07T20:31:11.8486764Z B=6, 2025-05-07T20:31:11.8486977Z MAX_T=6, 2025-05-07T20:31:11.8487207Z N_H_L=6, 2025-05-07T20:31:11.8487438Z ) 2025-05-07T20:31:11.8487660Z Trying example: test_gqa( 2025-05-07T20:31:11.8488040Z self=, 2025-05-07T20:31:11.8488455Z int4_kv=False, 2025-05-07T20:31:11.8488700Z num_groups=1, 2025-05-07T20:31:11.8488946Z B=70, 2025-05-07T20:31:11.8489167Z MAX_T=94, 2025-05-07T20:31:11.8489392Z N_H_L=78, 2025-05-07T20:31:11.8489627Z ) 2025-05-07T20:31:11.8489858Z Trying example: test_gqa( 2025-05-07T20:31:11.8490199Z self=, 2025-05-07T20:31:11.8490581Z int4_kv=False, 2025-05-07T20:31:11.8490834Z num_groups=1, 2025-05-07T20:31:11.8491073Z B=78, 2025-05-07T20:31:11.8491305Z MAX_T=94, 2025-05-07T20:31:11.8491538Z N_H_L=78, 2025-05-07T20:31:11.8491839Z ) 2025-05-07T20:31:11.8492074Z Trying example: test_gqa( 2025-05-07T20:31:11.8492424Z self=, 2025-05-07T20:31:11.8492810Z int4_kv=False, 2025-05-07T20:31:11.8493057Z num_groups=1, 2025-05-07T20:31:11.8493302Z B=94, 2025-05-07T20:31:11.8493525Z MAX_T=94, 2025-05-07T20:31:11.8493759Z N_H_L=78, 2025-05-07T20:31:11.8493985Z ) 2025-05-07T20:31:11.8494212Z Trying example: test_gqa( 2025-05-07T20:31:11.8494553Z self=, 2025-05-07T20:31:11.8494935Z int4_kv=False, 2025-05-07T20:31:11.8495295Z num_groups=1, 2025-05-07T20:31:11.8495540Z B=94, 2025-05-07T20:31:11.8495764Z MAX_T=94, 2025-05-07T20:31:11.8504662Z N_H_L=94, 2025-05-07T20:31:11.8504978Z ) 2025-05-07T20:31:11.8505262Z Trying example: test_gqa( 2025-05-07T20:31:11.8505597Z self=, 2025-05-07T20:31:11.8505917Z int4_kv=False, 2025-05-07T20:31:11.8506140Z num_groups=4, 2025-05-07T20:31:11.8506584Z B=41, 2025-05-07T20:31:11.8506773Z MAX_T=105, 2025-05-07T20:31:11.8506979Z N_H_L=126, 2025-05-07T20:31:11.8507178Z ) 2025-05-07T20:31:11.8507367Z Trying example: test_gqa( 2025-05-07T20:31:11.8507667Z self=, 2025-05-07T20:31:11.8507986Z int4_kv=False, 2025-05-07T20:31:11.8508202Z num_groups=4, 2025-05-07T20:31:11.8508406Z B=105, 2025-05-07T20:31:11.8508598Z MAX_T=105, 2025-05-07T20:31:11.8508795Z N_H_L=126, 2025-05-07T20:31:11.8508996Z ) 2025-05-07T20:31:11.8509189Z Trying example: test_gqa( 2025-05-07T20:31:11.8509483Z self=, 2025-05-07T20:31:11.8509803Z int4_kv=False, 2025-05-07T20:31:11.8510015Z num_groups=4, 2025-05-07T20:31:11.8510214Z B=105, 2025-05-07T20:31:11.8510403Z MAX_T=105, 2025-05-07T20:31:11.8510599Z N_H_L=105, 2025-05-07T20:31:11.8510784Z ) 2025-05-07T20:31:11.8510974Z Trying example: test_gqa( 2025-05-07T20:31:11.8511265Z self=, 2025-05-07T20:31:11.8511572Z int4_kv=True, 2025-05-07T20:31:11.8511779Z num_groups=1, 2025-05-07T20:31:11.8511982Z B=95, 2025-05-07T20:31:11.8512158Z MAX_T=114, 2025-05-07T20:31:11.8512357Z N_H_L=43, 2025-05-07T20:31:11.8512552Z ) 2025-05-07T20:31:11.8512734Z Trying example: test_gqa( 2025-05-07T20:31:11.8513219Z self=, 2025-05-07T20:31:11.8513537Z int4_kv=True, 2025-05-07T20:31:11.8513740Z num_groups=1, 2025-05-07T20:31:11.8513945Z B=43, 2025-05-07T20:31:11.8514133Z MAX_T=114, 2025-05-07T20:31:11.8514326Z N_H_L=43, 2025-05-07T20:31:11.8514516Z ) 2025-05-07T20:31:11.8514705Z Trying example: test_gqa( 2025-05-07T20:31:11.8514996Z self=, 2025-05-07T20:31:11.8515302Z int4_kv=True, 2025-05-07T20:31:11.8515508Z num_groups=1, 2025-05-07T20:31:11.8515710Z B=43, 2025-05-07T20:31:11.8515890Z MAX_T=43, 2025-05-07T20:31:11.8516085Z N_H_L=43, 2025-05-07T20:31:11.8516275Z ) 2025-05-07T20:31:11.8516463Z Trying example: test_gqa( 2025-05-07T20:31:11.8516754Z self=, 2025-05-07T20:31:11.8517070Z int4_kv=False, 2025-05-07T20:31:11.8517276Z num_groups=1, 2025-05-07T20:31:11.8517480Z B=21, 2025-05-07T20:31:11.8517665Z MAX_T=38, 2025-05-07T20:31:11.8517859Z N_H_L=42, 2025-05-07T20:31:11.8518051Z ) 2025-05-07T20:31:11.8518245Z Trying example: test_gqa( 2025-05-07T20:31:11.8518530Z self=, 2025-05-07T20:31:11.8518848Z int4_kv=False, 2025-05-07T20:31:11.8519058Z num_groups=1, 2025-05-07T20:31:11.8519255Z B=38, 2025-05-07T20:31:11.8519441Z MAX_T=38, 2025-05-07T20:31:11.8519641Z N_H_L=42, 2025-05-07T20:31:11.8519826Z ) 2025-05-07T20:31:11.8520016Z Trying example: test_gqa( 2025-05-07T20:31:11.8520308Z self=, 2025-05-07T20:31:11.8520615Z int4_kv=False, 2025-05-07T20:31:11.8520827Z num_groups=1, 2025-05-07T20:31:11.8521031Z B=38, 2025-05-07T20:31:11.8521209Z MAX_T=42, 2025-05-07T20:31:11.8521409Z N_H_L=42, 2025-05-07T20:31:11.8521601Z ) 2025-05-07T20:31:11.8521789Z Trying example: test_gqa( 2025-05-07T20:31:11.8522082Z self=, 2025-05-07T20:31:11.8522403Z int4_kv=False, 2025-05-07T20:31:11.8522619Z num_groups=1, 2025-05-07T20:31:11.8522817Z B=42, 2025-05-07T20:31:11.8523004Z MAX_T=42, 2025-05-07T20:31:11.8523201Z N_H_L=42, 2025-05-07T20:31:11.8523387Z ) 2025-05-07T20:31:11.8523774Z Trying example: test_gqa( 2025-05-07T20:31:11.8524069Z self=, 2025-05-07T20:31:11.8524375Z int4_kv=True, 2025-05-07T20:31:11.8524586Z num_groups=1, 2025-05-07T20:31:11.8524796Z B=74, 2025-05-07T20:31:11.8524979Z MAX_T=20, 2025-05-07T20:31:11.8525174Z N_H_L=15, 2025-05-07T20:31:11.8525372Z ) 2025-05-07T20:31:11.8525556Z Trying example: test_gqa( 2025-05-07T20:31:11.8525845Z self=, 2025-05-07T20:31:11.8526155Z int4_kv=True, 2025-05-07T20:31:11.8526357Z num_groups=1, 2025-05-07T20:31:11.8526559Z B=20, 2025-05-07T20:31:11.8526746Z MAX_T=20, 2025-05-07T20:31:11.8526933Z N_H_L=15, 2025-05-07T20:31:11.8527123Z ) 2025-05-07T20:31:11.8527324Z Trying example: test_gqa( 2025-05-07T20:31:11.8527609Z self=, 2025-05-07T20:31:11.8527921Z int4_kv=True, 2025-05-07T20:31:11.8528150Z num_groups=1, 2025-05-07T20:31:11.8528375Z B=20, 2025-05-07T20:31:11.8528567Z MAX_T=15, 2025-05-07T20:31:11.8528768Z N_H_L=15, 2025-05-07T20:31:11.8528953Z ) 2025-05-07T20:31:11.8529145Z Trying example: test_gqa( 2025-05-07T20:31:11.8529439Z self=, 2025-05-07T20:31:11.8529751Z int4_kv=True, 2025-05-07T20:31:11.8529951Z num_groups=1, 2025-05-07T20:31:11.8530155Z B=15, 2025-05-07T20:31:11.8530341Z MAX_T=20, 2025-05-07T20:31:11.8530532Z N_H_L=15, 2025-05-07T20:31:11.8530723Z ) 2025-05-07T20:31:11.8530916Z Trying example: test_gqa( 2025-05-07T20:31:11.8531199Z self=, 2025-05-07T20:31:11.8531513Z int4_kv=True, 2025-05-07T20:31:11.8531724Z num_groups=1, 2025-05-07T20:31:11.8532088Z B=15, 2025-05-07T20:31:11.8532442Z MAX_T=15, 2025-05-07T20:31:11.8532708Z N_H_L=15, 2025-05-07T20:31:11.8532966Z ) 2025-05-07T20:31:11.8533235Z Trying example: test_gqa( 2025-05-07T20:31:11.8533650Z self=, 2025-05-07T20:31:11.8534090Z int4_kv=False, 2025-05-07T20:31:11.8534384Z num_groups=4, 2025-05-07T20:31:11.8534660Z B=117, 2025-05-07T20:31:11.8534905Z MAX_T=104, 2025-05-07T20:31:11.8535173Z N_H_L=69, 2025-05-07T20:31:11.8535427Z ) 2025-05-07T20:31:11.8535621Z Trying example: test_gqa( 2025-05-07T20:31:11.8535919Z self=, 2025-05-07T20:31:11.8536237Z int4_kv=False, 2025-05-07T20:31:11.8536443Z num_groups=4, 2025-05-07T20:31:11.8536649Z B=117, 2025-05-07T20:31:11.8536841Z MAX_T=117, 2025-05-07T20:31:11.8537032Z N_H_L=69, 2025-05-07T20:31:11.8537225Z ) 2025-05-07T20:31:11.8537419Z Trying example: test_gqa( 2025-05-07T20:31:11.8537711Z self=, 2025-05-07T20:31:11.8538028Z int4_kv=False, 2025-05-07T20:31:11.8538240Z num_groups=4, 2025-05-07T20:31:11.8538445Z B=69, 2025-05-07T20:31:11.8538626Z MAX_T=117, 2025-05-07T20:31:11.8538823Z N_H_L=69, 2025-05-07T20:31:11.8539015Z ) 2025-05-07T20:31:11.8539205Z Trying example: test_gqa( 2025-05-07T20:31:11.8539498Z self=, 2025-05-07T20:31:11.8539812Z int4_kv=False, 2025-05-07T20:31:11.8540017Z num_groups=4, 2025-05-07T20:31:11.8540223Z B=117, 2025-05-07T20:31:11.8540412Z MAX_T=69, 2025-05-07T20:31:11.8540605Z N_H_L=69, 2025-05-07T20:31:11.8540797Z ) 2025-05-07T20:31:11.8540985Z PASSED 2025-05-07T20:31:11.8645955Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...) 2025-05-07T20:31:11.8646295Z 2025-05-07T20:31:11.8646450Z =========================== short test summary info ============================ 2025-05-07T20:31:11.8647190Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/unittest/case.py:154: Skip when CUDA is not available or xformers is not available 2025-05-07T20:31:11.8647917Z ======================== 1 passed, 1 skipped in 40.14s ========================= 2025-05-07T20:31:12.5174084Z 2025-05-07T20:31:12.5175049Z [TEST] Python test suite PASSED: ./attention/gqa_test.py 2025-05-07T20:31:12.5194203Z [TEST] Python test time for ./attention/gqa_test.py: 43 seconds 2025-05-07T20:31:12.5194493Z 2025-05-07T20:31:12.5194497Z 2025-05-07T20:31:12.5194501Z 2025-05-07T20:31:12.5194505Z 2025-05-07T20:31:12.5215172Z ################################################################################ 2025-05-07T20:31:12.5230679Z # [2025-05-07T20:31:12.522Z] Run Python Test Suite: 2025-05-07T20:31:12.5231029Z # ./coalesce/coalesce_test.py 2025-05-07T20:31:12.5231340Z ################################################################################ 2025-05-07T20:31:12.5257470Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py 2025-05-07T20:31:12.5258137Z 2025-05-07T20:31:14.6665674Z ============================= test session starts ============================== 2025-05-07T20:31:14.6666393Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:14.6666926Z cachedir: .pytest_cache 2025-05-07T20:31:14.6667529Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:14.6668288Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:14.6668709Z plugins: hypothesis-6.131.14 2025-05-07T20:31:16.4009791Z collecting ... collected 1 item 2025-05-07T20:31:16.4010007Z 2025-05-07T20:31:17.1565071Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED 2025-05-07T20:31:17.1565425Z 2025-05-07T20:31:17.1565570Z ============================== 1 passed in 2.61s =============================== 2025-05-07T20:31:17.7870992Z 2025-05-07T20:31:17.7871683Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py 2025-05-07T20:31:17.7892322Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds 2025-05-07T20:31:17.7892648Z 2025-05-07T20:31:17.7892654Z 2025-05-07T20:31:17.7892658Z 2025-05-07T20:31:17.7892662Z 2025-05-07T20:31:17.7913469Z ################################################################################ 2025-05-07T20:31:17.7928387Z # [2025-05-07T20:31:17.792Z] Run Python Test Suite: 2025-05-07T20:31:17.7928747Z # ./comm/multi_gpu_car_test.py 2025-05-07T20:31:17.7929039Z ################################################################################ 2025-05-07T20:31:17.7954173Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py 2025-05-07T20:31:17.7954826Z 2025-05-07T20:31:19.9449590Z ============================= test session starts ============================== 2025-05-07T20:31:19.9450266Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:19.9450810Z cachedir: .pytest_cache 2025-05-07T20:31:19.9451422Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:19.9452471Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:19.9453087Z plugins: hypothesis-6.131.14 2025-05-07T20:31:21.6590895Z collecting ... collected 5 items 2025-05-07T20:31:21.6591210Z 2025-05-07T20:31:21.6603913Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED 2025-05-07T20:31:21.6614470Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED 2025-05-07T20:31:21.6622521Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED 2025-05-07T20:31:21.6630512Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED 2025-05-07T20:31:21.6649538Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED 2025-05-07T20:31:21.6649904Z 2025-05-07T20:31:21.6650453Z =========================== short test summary info ============================ 2025-05-07T20:31:21.6651158Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:21.6652314Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:21.6653704Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:21.6654946Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:21.6655905Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:21.6656571Z ============================== 5 skipped in 1.84s ============================== 2025-05-07T20:31:22.2418226Z 2025-05-07T20:31:22.2419004Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py 2025-05-07T20:31:22.2437699Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 5 seconds 2025-05-07T20:31:22.2438010Z 2025-05-07T20:31:22.2438014Z 2025-05-07T20:31:22.2438017Z 2025-05-07T20:31:22.2438021Z 2025-05-07T20:31:22.2459159Z ################################################################################ 2025-05-07T20:31:22.2476866Z # [2025-05-07T20:31:22.247Z] Run Python Test Suite: 2025-05-07T20:31:22.2477227Z # ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:22.2477546Z ################################################################################ 2025-05-07T20:31:22.2502358Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:22.2503702Z 2025-05-07T20:31:24.4121596Z ============================= test session starts ============================== 2025-05-07T20:31:24.4122328Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:24.4122872Z cachedir: .pytest_cache 2025-05-07T20:31:24.4123477Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:24.4124240Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:24.4124661Z plugins: hypothesis-6.131.14 2025-05-07T20:31:26.2247005Z collecting ... collected 2 items 2025-05-07T20:31:26.2247219Z 2025-05-07T20:31:26.2258708Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED 2025-05-07T20:31:26.2275681Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED 2025-05-07T20:31:26.2276125Z 2025-05-07T20:31:26.2276289Z =========================== short test summary info ============================ 2025-05-07T20:31:26.2276933Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:31:26.2277783Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:31:26.2278398Z ============================== 2 skipped in 1.94s ============================== 2025-05-07T20:31:26.8165989Z 2025-05-07T20:31:26.8166479Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:26.8187534Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 4 seconds 2025-05-07T20:31:26.8187889Z 2025-05-07T20:31:26.8187894Z 2025-05-07T20:31:26.8187925Z 2025-05-07T20:31:26.8187929Z 2025-05-07T20:31:26.8208772Z ################################################################################ 2025-05-07T20:31:26.8227293Z # [2025-05-07T20:31:26.822Z] Run Python Test Suite: 2025-05-07T20:31:26.8227978Z # ./kv_cache/kv_cache_test.py 2025-05-07T20:31:26.8228275Z ################################################################################ 2025-05-07T20:31:26.8252354Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py 2025-05-07T20:31:26.8253298Z 2025-05-07T20:31:28.9709482Z ============================= test session starts ============================== 2025-05-07T20:31:28.9710297Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:28.9710834Z cachedir: .pytest_cache 2025-05-07T20:31:28.9711430Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:28.9712198Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:28.9712612Z plugins: hypothesis-6.131.14 2025-05-07T20:31:30.6584497Z collecting ... collected 4 items 2025-05-07T20:31:30.6584865Z 2025-05-07T20:31:33.4184163Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...) 2025-05-07T20:31:33.4268962Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED 2025-05-07T20:31:33.4364893Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED 2025-05-07T20:31:33.4454329Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED 2025-05-07T20:31:33.4454686Z 2025-05-07T20:31:33.4454845Z =========================== short test summary info ============================ 2025-05-07T20:31:33.4455563Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/unittest/case.py:154: Skip when H100 is not available or MI300 is not available 2025-05-07T20:31:33.4456776Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/unittest/case.py:154: Skip when xformers is not available 2025-05-07T20:31:33.4457407Z ============================== 4 skipped in 4.60s ============================== 2025-05-07T20:31:35.3945071Z 2025-05-07T20:31:35.3945862Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py 2025-05-07T20:31:35.3966029Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 9 seconds 2025-05-07T20:31:35.3966338Z 2025-05-07T20:31:35.3966342Z 2025-05-07T20:31:35.3966346Z 2025-05-07T20:31:35.3966351Z 2025-05-07T20:31:35.3986919Z ################################################################################ 2025-05-07T20:31:35.4001786Z # [2025-05-07T20:31:35.399Z] Run Python Test Suite: 2025-05-07T20:31:35.4002127Z # ./moe/activation_test.py 2025-05-07T20:31:35.4002405Z ################################################################################ 2025-05-07T20:31:35.4028817Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py 2025-05-07T20:31:35.4029686Z 2025-05-07T20:31:37.5505793Z ============================= test session starts ============================== 2025-05-07T20:31:37.5506784Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:37.5507328Z cachedir: .pytest_cache 2025-05-07T20:31:37.5507926Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:37.5508679Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:37.5509090Z plugins: hypothesis-6.131.14 2025-05-07T20:31:39.2004460Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:31:39.3077034Z collecting ... collected 2 items 2025-05-07T20:31:39.3077250Z 2025-05-07T20:31:44.6388003Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul( 2025-05-07T20:31:44.6388844Z self=, 2025-05-07T20:31:44.6389675Z T=1, 2025-05-07T20:31:44.6389913Z D=5120, 2025-05-07T20:31:44.6390178Z contiguous=True, 2025-05-07T20:31:44.6390488Z compiled=True, 2025-05-07T20:31:44.6390795Z ) 2025-05-07T20:31:44.6391070Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6391596Z self=, 2025-05-07T20:31:44.6392005Z T=4096, 2025-05-07T20:31:44.6392197Z D=5120, 2025-05-07T20:31:44.6392438Z contiguous=True, 2025-05-07T20:31:44.6392756Z compiled=True, 2025-05-07T20:31:44.6393020Z ) 2025-05-07T20:31:44.6393254Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6393629Z self=, 2025-05-07T20:31:44.6394018Z T=4096, 2025-05-07T20:31:44.6394226Z D=7168, 2025-05-07T20:31:44.6394428Z contiguous=False, 2025-05-07T20:31:44.6394654Z compiled=False, 2025-05-07T20:31:44.6394865Z ) 2025-05-07T20:31:44.6395065Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6395496Z self=, 2025-05-07T20:31:44.6395885Z T=4096, 2025-05-07T20:31:44.6396079Z D=5120, 2025-05-07T20:31:44.6396271Z contiguous=False, 2025-05-07T20:31:44.6396501Z compiled=True, 2025-05-07T20:31:44.6396704Z ) 2025-05-07T20:31:44.6396897Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6397277Z self=, 2025-05-07T20:31:44.6397665Z T=1, 2025-05-07T20:31:44.6397843Z D=7168, 2025-05-07T20:31:44.6398040Z contiguous=True, 2025-05-07T20:31:44.6398274Z compiled=True, 2025-05-07T20:31:44.6398477Z ) 2025-05-07T20:31:44.6398676Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6399059Z self=, 2025-05-07T20:31:44.6399657Z T=1, 2025-05-07T20:31:44.6399845Z D=7168, 2025-05-07T20:31:44.6400046Z contiguous=False, 2025-05-07T20:31:44.6400276Z compiled=True, 2025-05-07T20:31:44.6400483Z ) 2025-05-07T20:31:44.6400681Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6401062Z self=, 2025-05-07T20:31:44.6401443Z T=4096, 2025-05-07T20:31:44.6401634Z D=5120, 2025-05-07T20:31:44.6401836Z contiguous=False, 2025-05-07T20:31:44.6402062Z compiled=False, 2025-05-07T20:31:44.6402273Z ) 2025-05-07T20:31:44.6402474Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6402848Z self=, 2025-05-07T20:31:44.6403236Z T=1, 2025-05-07T20:31:44.6403423Z D=7168, 2025-05-07T20:31:44.6403617Z contiguous=True, 2025-05-07T20:31:44.6403845Z compiled=False, 2025-05-07T20:31:44.6404054Z ) 2025-05-07T20:31:44.6404252Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6404636Z self=, 2025-05-07T20:31:44.6405021Z T=2048, 2025-05-07T20:31:44.6405202Z D=5120, 2025-05-07T20:31:44.6405401Z contiguous=True, 2025-05-07T20:31:44.6405628Z compiled=True, 2025-05-07T20:31:44.6405829Z ) 2025-05-07T20:31:44.6406027Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6406780Z self=, 2025-05-07T20:31:44.6407168Z T=2048, 2025-05-07T20:31:44.6407350Z D=7168, 2025-05-07T20:31:44.6407545Z contiguous=True, 2025-05-07T20:31:44.6407769Z compiled=True, 2025-05-07T20:31:44.6407968Z ) 2025-05-07T20:31:44.6408167Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6408546Z self=, 2025-05-07T20:31:44.6408926Z T=2048, 2025-05-07T20:31:44.6409111Z D=7168, 2025-05-07T20:31:44.6409307Z contiguous=True, 2025-05-07T20:31:44.6409535Z compiled=False, 2025-05-07T20:31:44.6409743Z ) 2025-05-07T20:31:44.6409940Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6410311Z self=, 2025-05-07T20:31:44.6410869Z T=128, 2025-05-07T20:31:44.6411069Z D=5120, 2025-05-07T20:31:44.6411268Z contiguous=False, 2025-05-07T20:31:44.6411494Z compiled=True, 2025-05-07T20:31:44.6411697Z ) 2025-05-07T20:31:44.6412001Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6419703Z self=, 2025-05-07T20:31:44.6420106Z T=128, 2025-05-07T20:31:44.6420287Z D=5120, 2025-05-07T20:31:44.6420482Z contiguous=True, 2025-05-07T20:31:44.6420708Z compiled=True, 2025-05-07T20:31:44.6420917Z ) 2025-05-07T20:31:44.6421113Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6421503Z self=, 2025-05-07T20:31:44.6421898Z T=16384, 2025-05-07T20:31:44.6422105Z D=5120, 2025-05-07T20:31:44.6422304Z contiguous=False, 2025-05-07T20:31:44.6422536Z compiled=True, 2025-05-07T20:31:44.6422735Z ) 2025-05-07T20:31:44.6422935Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6423322Z self=, 2025-05-07T20:31:44.6423704Z T=16384, 2025-05-07T20:31:44.6423901Z D=5120, 2025-05-07T20:31:44.6424106Z contiguous=False, 2025-05-07T20:31:44.6424332Z compiled=False, 2025-05-07T20:31:44.6424539Z ) 2025-05-07T20:31:44.6424736Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6425112Z self=, 2025-05-07T20:31:44.6425491Z T=128, 2025-05-07T20:31:44.6425679Z D=7168, 2025-05-07T20:31:44.6425869Z contiguous=True, 2025-05-07T20:31:44.6426095Z compiled=False, 2025-05-07T20:31:44.6426300Z ) 2025-05-07T20:31:44.6426489Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6426868Z self=, 2025-05-07T20:31:44.6427423Z T=128, 2025-05-07T20:31:44.6427609Z D=7168, 2025-05-07T20:31:44.6427831Z contiguous=False, 2025-05-07T20:31:44.6428083Z compiled=False, 2025-05-07T20:31:44.6428289Z ) 2025-05-07T20:31:44.6428485Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6428863Z self=, 2025-05-07T20:31:44.6429246Z T=1, 2025-05-07T20:31:44.6429429Z D=5120, 2025-05-07T20:31:44.6429627Z contiguous=False, 2025-05-07T20:31:44.6429848Z compiled=False, 2025-05-07T20:31:44.6430057Z ) 2025-05-07T20:31:44.6430256Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6430629Z self=, 2025-05-07T20:31:44.6431012Z T=1, 2025-05-07T20:31:44.6431196Z D=7168, 2025-05-07T20:31:44.6431386Z contiguous=False, 2025-05-07T20:31:44.6431613Z compiled=False, 2025-05-07T20:31:44.6431821Z ) 2025-05-07T20:31:44.6432022Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6432392Z self=, 2025-05-07T20:31:44.6432777Z T=4096, 2025-05-07T20:31:44.6432966Z D=5120, 2025-05-07T20:31:44.6433162Z contiguous=True, 2025-05-07T20:31:44.6433388Z compiled=False, 2025-05-07T20:31:44.6433591Z ) 2025-05-07T20:31:44.6433782Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6434157Z self=, 2025-05-07T20:31:44.6434541Z T=128, 2025-05-07T20:31:44.6434724Z D=7168, 2025-05-07T20:31:44.6434924Z contiguous=True, 2025-05-07T20:31:44.6435150Z compiled=True, 2025-05-07T20:31:44.6435348Z ) 2025-05-07T20:31:44.6435543Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6435918Z self=, 2025-05-07T20:31:44.6436296Z T=1, 2025-05-07T20:31:44.6436480Z D=5120, 2025-05-07T20:31:44.6436677Z contiguous=False, 2025-05-07T20:31:44.6436903Z compiled=True, 2025-05-07T20:31:44.6437111Z ) 2025-05-07T20:31:44.6437307Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6437681Z self=, 2025-05-07T20:31:44.6438165Z T=4096, 2025-05-07T20:31:44.6438360Z D=7168, 2025-05-07T20:31:44.6438557Z contiguous=True, 2025-05-07T20:31:44.6438778Z compiled=False, 2025-05-07T20:31:44.6438985Z ) 2025-05-07T20:31:44.6439183Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6439553Z self=, 2025-05-07T20:31:44.6439938Z T=4096, 2025-05-07T20:31:44.6440128Z D=7168, 2025-05-07T20:31:44.6440312Z contiguous=False, 2025-05-07T20:31:44.6440538Z compiled=True, 2025-05-07T20:31:44.6440747Z ) 2025-05-07T20:31:44.6440939Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6441314Z self=, 2025-05-07T20:31:44.6441707Z T=128, 2025-05-07T20:31:44.6441888Z D=5120, 2025-05-07T20:31:44.6442082Z contiguous=True, 2025-05-07T20:31:44.6442305Z compiled=False, 2025-05-07T20:31:44.6442504Z ) 2025-05-07T20:31:44.6442697Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6443074Z self=, 2025-05-07T20:31:44.6443449Z T=128, 2025-05-07T20:31:44.6443635Z D=5120, 2025-05-07T20:31:44.6443834Z contiguous=False, 2025-05-07T20:31:44.6444061Z compiled=False, 2025-05-07T20:31:44.6444257Z ) 2025-05-07T20:31:44.6444450Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6444822Z self=, 2025-05-07T20:31:44.6445195Z T=1, 2025-05-07T20:31:44.6445376Z D=5120, 2025-05-07T20:31:44.6445569Z contiguous=True, 2025-05-07T20:31:44.6445785Z compiled=False, 2025-05-07T20:31:44.6445986Z ) 2025-05-07T20:31:44.6446181Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6446642Z self=, 2025-05-07T20:31:44.6447024Z T=2048, 2025-05-07T20:31:44.6447212Z D=7168, 2025-05-07T20:31:44.6447399Z contiguous=False, 2025-05-07T20:31:44.6447621Z compiled=True, 2025-05-07T20:31:44.6447830Z ) 2025-05-07T20:31:44.6448017Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6448389Z self=, 2025-05-07T20:31:44.6448772Z T=2048, 2025-05-07T20:31:44.6448951Z D=7168, 2025-05-07T20:31:44.6449146Z contiguous=False, 2025-05-07T20:31:44.6449371Z compiled=False, 2025-05-07T20:31:44.6449567Z ) 2025-05-07T20:31:44.6449761Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6450135Z self=, 2025-05-07T20:31:44.6450516Z T=16384, 2025-05-07T20:31:44.6450705Z D=7168, 2025-05-07T20:31:44.6450904Z contiguous=False, 2025-05-07T20:31:44.6451130Z compiled=True, 2025-05-07T20:31:44.6451332Z ) 2025-05-07T20:31:44.6451531Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6451977Z self=, 2025-05-07T20:31:44.6452353Z T=16384, 2025-05-07T20:31:44.6452547Z D=7168, 2025-05-07T20:31:44.6452746Z contiguous=True, 2025-05-07T20:31:44.6452962Z compiled=True, 2025-05-07T20:31:44.6453166Z ) 2025-05-07T20:31:44.6453364Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6453731Z self=, 2025-05-07T20:31:44.6454113Z T=4096, 2025-05-07T20:31:44.6454301Z D=7168, 2025-05-07T20:31:44.6454487Z contiguous=True, 2025-05-07T20:31:44.6454706Z compiled=True, 2025-05-07T20:31:44.6454909Z ) 2025-05-07T20:31:44.6455098Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6455469Z self=, 2025-05-07T20:31:44.6455853Z T=2048, 2025-05-07T20:31:44.6456039Z D=5120, 2025-05-07T20:31:44.6456232Z contiguous=False, 2025-05-07T20:31:44.6456455Z compiled=False, 2025-05-07T20:31:44.6456659Z ) 2025-05-07T20:31:44.6456848Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6457322Z self=, 2025-05-07T20:31:44.6457723Z T=2048, 2025-05-07T20:31:44.6457931Z D=5120, 2025-05-07T20:31:44.6458125Z contiguous=True, 2025-05-07T20:31:44.6458348Z compiled=False, 2025-05-07T20:31:44.6458548Z ) 2025-05-07T20:31:44.6458746Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6459119Z self=, 2025-05-07T20:31:44.6459493Z T=128, 2025-05-07T20:31:44.6459682Z D=7168, 2025-05-07T20:31:44.6459878Z contiguous=False, 2025-05-07T20:31:44.6460095Z compiled=True, 2025-05-07T20:31:44.6460296Z ) 2025-05-07T20:31:44.6460491Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6460860Z self=, 2025-05-07T20:31:44.6461246Z T=16384, 2025-05-07T20:31:44.6461441Z D=5120, 2025-05-07T20:31:44.6461627Z contiguous=True, 2025-05-07T20:31:44.6461848Z compiled=True, 2025-05-07T20:31:44.6462052Z ) 2025-05-07T20:31:44.6462249Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6462625Z self=, 2025-05-07T20:31:44.6463009Z T=2048, 2025-05-07T20:31:44.6463198Z D=5120, 2025-05-07T20:31:44.6463386Z contiguous=False, 2025-05-07T20:31:44.6463610Z compiled=True, 2025-05-07T20:31:44.6463812Z ) 2025-05-07T20:31:44.6464001Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6464376Z self=, 2025-05-07T20:31:44.6464756Z T=16384, 2025-05-07T20:31:44.6464942Z D=5120, 2025-05-07T20:31:44.6465141Z contiguous=True, 2025-05-07T20:31:44.6465367Z compiled=False, 2025-05-07T20:31:44.6465567Z ) 2025-05-07T20:31:44.6465764Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6466256Z self=, 2025-05-07T20:31:44.6466631Z T=16384, 2025-05-07T20:31:44.6466826Z D=7168, 2025-05-07T20:31:44.6467021Z contiguous=False, 2025-05-07T20:31:44.6467244Z compiled=False, 2025-05-07T20:31:44.6467446Z ) 2025-05-07T20:31:44.6467641Z Trying example: test_silu_mul( 2025-05-07T20:31:44.6468026Z self=, 2025-05-07T20:31:44.6468400Z T=16384, 2025-05-07T20:31:44.6468597Z D=7168, 2025-05-07T20:31:44.6468793Z contiguous=True, 2025-05-07T20:31:44.6469012Z compiled=False, 2025-05-07T20:31:44.6469217Z ) 2025-05-07T20:31:44.6469395Z PASSED 2025-05-07T20:31:44.7053068Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:44.7054191Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:44.7055625Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:44.7057132Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:44.7058138Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:44.7059493Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:44.7060946Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.7062331Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:44.7063623Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:44.7065063Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.7066176Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:44.7067523Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:44.7068823Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:44.7070100Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:44.7071365Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:44.7072233Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:44.7073456Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:44.7074519Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:44.7075351Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:31:44.7076617Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:44.7077959Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:44.7079135Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:44.7080224Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:44.7081459Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:44.7082880Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:44.7083991Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.7084941Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.7085834Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:44.7086898Z W0507 20:31:44.703000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.7214723Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:44.7215826Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:44.7217196Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:44.7218693Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:44.7219698Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:44.7221040Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:44.7222465Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.7223743Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:44.7225019Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:44.7226441Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.7227534Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:44.7228854Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:44.7230150Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:44.7231406Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:44.7232655Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:44.7233505Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:44.7234562Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:44.7235790Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:44.7236609Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:31:44.7237859Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:44.7239182Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:44.7240338Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:44.7241412Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:44.7242627Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:44.7244027Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:44.7245118Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.7246055Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.7246890Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:44.7247992Z W0507 20:31:44.720000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.7604745Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:44.7605854Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:44.7607441Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:44.7608959Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:44.7609968Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:44.7611318Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:44.7612847Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.7613869Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:44.7616345Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:44.7617791Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.7618890Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:44.7620217Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:44.7621520Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:44.7622784Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:44.7624039Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:44.7624888Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:44.7625944Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:44.7627140Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:44.7627953Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:31:44.7629197Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:44.7630524Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:44.7631681Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:44.7632763Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:44.7633986Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:44.7635394Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:44.7636495Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.7637436Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.7638190Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:44.7639329Z W0507 20:31:44.759000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.7648809Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:44.7649962Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:44.7651354Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:44.7652901Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:44.7653929Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:44.7655299Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:44.7656751Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.7657779Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:44.7659248Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:44.7660693Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.7661805Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:44.7663152Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:44.7664457Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:44.7665740Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:44.7666997Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:44.7667862Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:44.7668930Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:44.7669994Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:44.7670819Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:31:44.7672162Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:44.7673501Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:44.7674669Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:44.7675756Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:44.7676984Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:44.7678411Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:44.7679525Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.7680479Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.7681249Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:44.7682309Z W0507 20:31:44.763000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.1797394Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.1798383Z self=, 2025-05-07T20:31:45.1798894Z T=1, 2025-05-07T20:31:45.1799077Z D=5120, 2025-05-07T20:31:45.1799265Z scale_ub=None, 2025-05-07T20:31:45.1799480Z contiguous=True, 2025-05-07T20:31:45.1799711Z compiled=True, 2025-05-07T20:31:45.1799916Z ) 2025-05-07T20:31:45.1800250Z self = 2025-05-07T20:31:45.1800752Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.1801021Z 2025-05-07T20:31:45.1801110Z @given( 2025-05-07T20:31:45.1801341Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.1801668Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.1801997Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.1802328Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.1802673Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.1802967Z ) 2025-05-07T20:31:45.1803321Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.1803782Z def test_silu_mul_quant( 2025-05-07T20:31:45.1804032Z self, 2025-05-07T20:31:45.1804232Z T: int, 2025-05-07T20:31:45.1804426Z D: int, 2025-05-07T20:31:45.1804648Z scale_ub: Optional[float], 2025-05-07T20:31:45.1804929Z contiguous: bool, 2025-05-07T20:31:45.1805167Z compiled: bool, 2025-05-07T20:31:45.1805409Z ) -> None: 2025-05-07T20:31:45.1805632Z torch.manual_seed(2025) 2025-05-07T20:31:45.1805874Z 2025-05-07T20:31:45.1806409Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.1806807Z 2025-05-07T20:31:45.1807006Z x_sign = torch.sign(x) 2025-05-07T20:31:45.1807327Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.1807674Z x = x_sign * x_clamp 2025-05-07T20:31:45.1808267Z x0 = x[:, :D] 2025-05-07T20:31:45.1808493Z x1 = x[:, D:] 2025-05-07T20:31:45.1808703Z 2025-05-07T20:31:45.1808889Z if contiguous: 2025-05-07T20:31:45.1809124Z x0 = x0.contiguous() 2025-05-07T20:31:45.1809386Z x1 = x1.contiguous() 2025-05-07T20:31:45.1809620Z 2025-05-07T20:31:45.1809819Z if scale_ub is not None: 2025-05-07T20:31:45.1810093Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.1810434Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.1810744Z ) 2025-05-07T20:31:45.1810941Z else: 2025-05-07T20:31:45.1811152Z scale_ub_tensor = None 2025-05-07T20:31:45.1811399Z 2025-05-07T20:31:45.1811641Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.1819106Z op = silu_mul_quant 2025-05-07T20:31:45.1819414Z if compiled: 2025-05-07T20:31:45.1819676Z op = torch.compile(op) 2025-05-07T20:31:45.1819990Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.1820268Z 2025-05-07T20:31:45.1820468Z y_fp8, y_scale = fn() 2025-05-07T20:31:45.1820769Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:45.1821064Z 2025-05-07T20:31:45.1821315Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.1821663Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:45.1821962Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:45.1822287Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:45.1822659Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.1822972Z 2025-05-07T20:31:45.1823180Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:45.1823600Z 2025-05-07T20:31:45.1823706Z moe/activation_test.py:126: 2025-05-07T20:31:45.1824019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.1824364Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:45.1824704Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.1825523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:45.1826295Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:45.1826864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.1827562Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.1828267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:45.1829017Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.1829770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:45.1830434Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:45.1831051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:45.1831587Z fn() 2025-05-07T20:31:45.1832110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:45.1832702Z self.fn.run( 2025-05-07T20:31:45.1833186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.1833733Z kernel = self.compile( 2025-05-07T20:31:45.1834290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.1834963Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.1835374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.1835697Z 2025-05-07T20:31:45.1835918Z self = 2025-05-07T20:31:45.1837040Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.1838478Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4486d9fba0>} 2025-05-07T20:31:45.1839875Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.1840948Z context = 2025-05-07T20:31:45.1841243Z 2025-05-07T20:31:45.1841426Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.1841958Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.1842437Z module_map=module_map) 2025-05-07T20:31:45.1842809Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.1843174Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:45.1843441Z E ^ 2025-05-07T20:31:45.1843920Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.1844387Z 2025-05-07T20:31:45.1844826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.1845441Z 2025-05-07T20:31:45.1845551Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.1845969Z self=, 2025-05-07T20:31:45.1846385Z T=2048, 2025-05-07T20:31:45.1846582Z D=5120, 2025-05-07T20:31:45.1846771Z scale_ub=1200.0, 2025-05-07T20:31:45.1847002Z contiguous=True, 2025-05-07T20:31:45.1847229Z compiled=False, 2025-05-07T20:31:45.1847434Z ) 2025-05-07T20:31:45.4723762Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:45.4725212Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:45.4726607Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:45.4728148Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:45.4729154Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.4730506Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:45.4732027Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4733062Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.4734697Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:45.4736138Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4737245Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.4738579Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:45.4739895Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:45.4741171Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:45.4742429Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:45.4743291Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.4744361Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:45.4745573Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:45.4746586Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:31:45.4747849Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:45.4749185Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:45.4750346Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:45.4751437Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:45.4752663Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:45.4754073Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:45.4755174Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4756118Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4756883Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:45.4758023Z W0507 20:31:45.468000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5523703Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:45.5524902Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:45.5526281Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:45.5527764Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:45.5528795Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.5530139Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:45.5531568Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5532665Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.5534278Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:45.5535709Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5536800Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.5538127Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:45.5539428Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:45.5540707Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:45.5541966Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:45.5542820Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.5543880Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:45.5544935Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:45.5545759Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:31:45.5547170Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:45.5548502Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:45.5549660Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:45.5550739Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:45.5551967Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:45.5553366Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:45.5554465Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5555404Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5556167Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:45.5557214Z W0507 20:31:45.549000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.7828203Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:45.7829343Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:45.7830726Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:45.7832304Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:45.7833323Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.7834670Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:45.7836100Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.7837111Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.7838381Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:45.7840144Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.7841240Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.7842562Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:45.7843855Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:45.7845123Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:45.7846377Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:45.7847222Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.7848278Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:45.7849329Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:45.7850143Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:31:45.7851528Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:45.7852936Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:45.7854095Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:45.7855171Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:45.7856387Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:45.7857793Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:45.7858887Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.7859824Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.7860594Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:45.7861651Z W0507 20:31:45.779000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.7930837Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:45.7932314Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:45.7933705Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:45.7935182Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:45.7936188Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.7937539Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:45.7938968Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.7939982Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.7941255Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:45.7942802Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.7943902Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.7945224Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:45.7946520Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:45.7947787Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:45.7949099Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:45.7949954Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.7951012Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:45.7952067Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:45.7952886Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:31:45.7954143Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:45.7955550Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:45.7956704Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:45.7957779Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:45.7959045Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:45.7960456Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:45.7961550Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.7962493Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.7963254Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:45.7964302Z W0507 20:31:45.790000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.1344517Z self = 2025-05-07T20:31:46.1345736Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:46.1346026Z 2025-05-07T20:31:46.1346110Z @given( 2025-05-07T20:31:46.1346350Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.1346688Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.1346995Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.1347341Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.1347679Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.1347965Z ) 2025-05-07T20:31:46.1348328Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.1348783Z def test_silu_mul_quant( 2025-05-07T20:31:46.1349032Z self, 2025-05-07T20:31:46.1349226Z T: int, 2025-05-07T20:31:46.1349424Z D: int, 2025-05-07T20:31:46.1349645Z scale_ub: Optional[float], 2025-05-07T20:31:46.1349916Z contiguous: bool, 2025-05-07T20:31:46.1350177Z compiled: bool, 2025-05-07T20:31:46.1350410Z ) -> None: 2025-05-07T20:31:46.1350625Z torch.manual_seed(2025) 2025-05-07T20:31:46.1350872Z 2025-05-07T20:31:46.1351156Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.1351498Z 2025-05-07T20:31:46.1351695Z x_sign = torch.sign(x) 2025-05-07T20:31:46.1351990Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.1352300Z x = x_sign * x_clamp 2025-05-07T20:31:46.1352545Z x0 = x[:, :D] 2025-05-07T20:31:46.1352767Z x1 = x[:, D:] 2025-05-07T20:31:46.1352975Z 2025-05-07T20:31:46.1353166Z if contiguous: 2025-05-07T20:31:46.1353404Z x0 = x0.contiguous() 2025-05-07T20:31:46.1353661Z x1 = x1.contiguous() 2025-05-07T20:31:46.1353904Z 2025-05-07T20:31:46.1354098Z if scale_ub is not None: 2025-05-07T20:31:46.1354368Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.1354716Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.1355030Z ) 2025-05-07T20:31:46.1355228Z else: 2025-05-07T20:31:46.1355438Z scale_ub_tensor = None 2025-05-07T20:31:46.1355695Z 2025-05-07T20:31:46.1356093Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.1356415Z op = silu_mul_quant 2025-05-07T20:31:46.1356675Z if compiled: 2025-05-07T20:31:46.1356933Z op = torch.compile(op) 2025-05-07T20:31:46.1357232Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.1357514Z 2025-05-07T20:31:46.1357714Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.1357881Z 2025-05-07T20:31:46.1357983Z moe/activation_test.py:117: 2025-05-07T20:31:46.1358290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.1358629Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.1358915Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.1359632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.1360347Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.1360905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.1361602Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.1362290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.1362839Z kernel = self.compile( 2025-05-07T20:31:46.1363397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.1364066Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.1364477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.1364799Z 2025-05-07T20:31:46.1365018Z self = 2025-05-07T20:31:46.1366139Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.1367569Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44857e9620>} 2025-05-07T20:31:46.1368957Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.1370033Z context = 2025-05-07T20:31:46.1370333Z 2025-05-07T20:31:46.1370508Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.1371054Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.1371544Z module_map=module_map) 2025-05-07T20:31:46.1372007Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.1372368Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.1372624Z E ^ 2025-05-07T20:31:46.1373101Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.1373565Z 2025-05-07T20:31:46.1374000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.1374528Z 2025-05-07T20:31:46.1374637Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.1375057Z self=, 2025-05-07T20:31:46.1375471Z T=2048, 2025-05-07T20:31:46.1375676Z D=5120, 2025-05-07T20:31:46.1375864Z scale_ub=1200.0, 2025-05-07T20:31:46.1376089Z contiguous=True, 2025-05-07T20:31:46.1376312Z compiled=True, 2025-05-07T20:31:46.1376514Z ) 2025-05-07T20:31:46.1376929Z self = 2025-05-07T20:31:46.1377439Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:46.1377717Z 2025-05-07T20:31:46.1377793Z @given( 2025-05-07T20:31:46.1378028Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.1378343Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.1378654Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.1378981Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.1379313Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.1379615Z ) 2025-05-07T20:31:46.1379964Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.1380425Z def test_silu_mul_quant( 2025-05-07T20:31:46.1380671Z self, 2025-05-07T20:31:46.1380864Z T: int, 2025-05-07T20:31:46.1381066Z D: int, 2025-05-07T20:31:46.1381294Z scale_ub: Optional[float], 2025-05-07T20:31:46.1381561Z contiguous: bool, 2025-05-07T20:31:46.1381809Z compiled: bool, 2025-05-07T20:31:46.1382036Z ) -> None: 2025-05-07T20:31:46.1382248Z torch.manual_seed(2025) 2025-05-07T20:31:46.1382496Z 2025-05-07T20:31:46.1382774Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.1383118Z 2025-05-07T20:31:46.1383315Z x_sign = torch.sign(x) 2025-05-07T20:31:46.1383615Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.1383933Z x = x_sign * x_clamp 2025-05-07T20:31:46.1384169Z x0 = x[:, :D] 2025-05-07T20:31:46.1384390Z x1 = x[:, D:] 2025-05-07T20:31:46.1384601Z 2025-05-07T20:31:46.1384869Z if contiguous: 2025-05-07T20:31:46.1385101Z x0 = x0.contiguous() 2025-05-07T20:31:46.1385361Z x1 = x1.contiguous() 2025-05-07T20:31:46.1385596Z 2025-05-07T20:31:46.1385790Z if scale_ub is not None: 2025-05-07T20:31:46.1386073Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.1386406Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.1386721Z ) 2025-05-07T20:31:46.1386917Z else: 2025-05-07T20:31:46.1387122Z scale_ub_tensor = None 2025-05-07T20:31:46.1387375Z 2025-05-07T20:31:46.1387609Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.1387923Z op = silu_mul_quant 2025-05-07T20:31:46.1388181Z if compiled: 2025-05-07T20:31:46.1388433Z op = torch.compile(op) 2025-05-07T20:31:46.1388730Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.1389000Z 2025-05-07T20:31:46.1389192Z y_fp8, y_scale = fn() 2025-05-07T20:31:46.1389485Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:46.1389774Z 2025-05-07T20:31:46.1390017Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.1390360Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:46.1390653Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:46.1390977Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:46.1391348Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:46.1391658Z 2025-05-07T20:31:46.1391861Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:46.1392067Z 2025-05-07T20:31:46.1392170Z moe/activation_test.py:126: 2025-05-07T20:31:46.1392476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.1392818Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:46.1393153Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:46.1393963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:46.1394739Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:46.1395381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.1396089Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.1396802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:46.1397541Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:46.1398297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:46.1398958Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:46.1399581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:46.1400112Z fn() 2025-05-07T20:31:46.1400640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:46.1401243Z self.fn.run( 2025-05-07T20:31:46.1401716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.1402262Z kernel = self.compile( 2025-05-07T20:31:46.1402818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.1403495Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.1403901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.1404142Z 2025-05-07T20:31:46.1404353Z self = 2025-05-07T20:31:46.1405472Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.1407374Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f448577e980>} 2025-05-07T20:31:46.1408772Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.1409835Z context = 2025-05-07T20:31:46.1410136Z 2025-05-07T20:31:46.1410313Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.1410851Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.1411330Z module_map=module_map) 2025-05-07T20:31:46.1411702Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.1412158Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:46.1412429Z E ^ 2025-05-07T20:31:46.1412908Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.1413377Z 2025-05-07T20:31:46.1413806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.1414338Z 2025-05-07T20:31:46.1414448Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.1414866Z self=, 2025-05-07T20:31:46.1415279Z T=16384, 2025-05-07T20:31:46.1415469Z D=7168, 2025-05-07T20:31:46.1415658Z scale_ub=1200.0, 2025-05-07T20:31:46.1415884Z contiguous=False, 2025-05-07T20:31:46.1416122Z compiled=False, 2025-05-07T20:31:46.1416328Z ) 2025-05-07T20:31:46.3239807Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:46.3242047Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:46.3244826Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:46.3247777Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:46.3248988Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:46.3250373Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:46.3251895Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.3260720Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:46.3262019Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:46.3263681Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.3264789Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:46.3266119Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:46.3267416Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:46.3268682Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:46.3269947Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:46.3270796Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:46.3271857Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:46.3272915Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:46.3273743Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:31:46.3274998Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:46.3276450Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:46.3277616Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:46.3278699Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:46.3279923Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:46.3281329Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:46.3282435Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.3283378Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.3284144Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:46.3285186Z W0507 20:31:46.320000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.3817498Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:46.3818895Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:46.3820272Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:46.3821744Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:46.3822742Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:46.3824086Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:46.3825522Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.3826533Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:46.3827797Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:46.3829209Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.3830446Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:46.3831763Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:46.3833053Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:46.3834321Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:46.3835579Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:46.3836435Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:46.3837501Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:46.3838556Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:46.3839385Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:31:46.3840634Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:46.3842049Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:46.3843209Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:46.3844292Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:46.3845514Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:46.3846915Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:46.3848026Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.3848973Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.3849741Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:46.3850792Z W0507 20:31:46.378000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.5708474Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:46.5711377Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:46.5714632Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:46.5717603Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:46.5718910Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:46.5720267Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:46.5721717Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.5722739Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:46.5724008Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:46.5725431Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.5726531Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:46.5728023Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:46.5729316Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:46.5730579Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:46.5731915Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:46.5732771Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:46.5733835Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:46.5734896Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:46.5735722Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:31:46.5736973Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:46.5738301Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:46.5739582Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:46.5740669Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:46.5741896Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:46.5743303Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:46.5744403Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.5745363Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.5746138Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:46.5747190Z W0507 20:31:46.567000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.5801414Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:46.5802786Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:46.5804305Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:46.5805785Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:46.5807126Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:46.5808508Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:46.5809972Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.5811011Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:46.5812357Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:46.5813785Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.5814878Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:46.5816368Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:46.5817658Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:46.5818969Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:46.5820222Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:46.5821071Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:46.5822138Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:46.5823197Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:46.5824016Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:31:46.5825263Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:46.5826590Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:46.5827877Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:46.5828964Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:46.5830186Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:46.5831591Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:46.5832693Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.5833642Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.5834411Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:46.5835464Z W0507 20:31:46.577000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:47.3271399Z self = 2025-05-07T20:31:47.3272148Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:47.3272561Z 2025-05-07T20:31:47.3272646Z @given( 2025-05-07T20:31:47.3272887Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.3273206Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.3273522Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.3273890Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.3274232Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.3274524Z ) 2025-05-07T20:31:47.3275400Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.3275977Z def test_silu_mul_quant( 2025-05-07T20:31:47.3276229Z self, 2025-05-07T20:31:47.3276440Z T: int, 2025-05-07T20:31:47.3276649Z D: int, 2025-05-07T20:31:47.3276872Z scale_ub: Optional[float], 2025-05-07T20:31:47.3277158Z contiguous: bool, 2025-05-07T20:31:47.3277415Z compiled: bool, 2025-05-07T20:31:47.3277649Z ) -> None: 2025-05-07T20:31:47.3277875Z torch.manual_seed(2025) 2025-05-07T20:31:47.3278133Z 2025-05-07T20:31:47.3278436Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.3278817Z 2025-05-07T20:31:47.3279022Z x_sign = torch.sign(x) 2025-05-07T20:31:47.3279325Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:47.3279650Z x = x_sign * x_clamp 2025-05-07T20:31:47.3279901Z x0 = x[:, :D] 2025-05-07T20:31:47.3280129Z x1 = x[:, D:] 2025-05-07T20:31:47.3280339Z 2025-05-07T20:31:47.3280539Z if contiguous: 2025-05-07T20:31:47.3280811Z x0 = x0.contiguous() 2025-05-07T20:31:47.3281081Z x1 = x1.contiguous() 2025-05-07T20:31:47.3281320Z 2025-05-07T20:31:47.3281523Z if scale_ub is not None: 2025-05-07T20:31:47.3281806Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:47.3282152Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:47.3282466Z ) 2025-05-07T20:31:47.3282668Z else: 2025-05-07T20:31:47.3282888Z scale_ub_tensor = None 2025-05-07T20:31:47.3283145Z 2025-05-07T20:31:47.3283383Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:47.3283713Z op = silu_mul_quant 2025-05-07T20:31:47.3284142Z if compiled: 2025-05-07T20:31:47.3284399Z op = torch.compile(op) 2025-05-07T20:31:47.3284704Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.3284983Z 2025-05-07T20:31:47.3285190Z > y_fp8, y_scale = fn() 2025-05-07T20:31:47.3285363Z 2025-05-07T20:31:47.3285474Z moe/activation_test.py:117: 2025-05-07T20:31:47.3285778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.3286122Z moe/activation_test.py:115: in fn 2025-05-07T20:31:47.3286413Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.3287141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:47.3287860Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:47.3288427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:47.3289200Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:47.3289890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:47.3290456Z kernel = self.compile( 2025-05-07T20:31:47.3291019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:47.3291704Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:47.3292205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.3292452Z 2025-05-07T20:31:47.3292680Z self = 2025-05-07T20:31:47.3293812Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:47.3295265Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44849e1620>} 2025-05-07T20:31:47.3296751Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:47.3297812Z context = 2025-05-07T20:31:47.3298119Z 2025-05-07T20:31:47.3298293Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:47.3298859Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:47.3299378Z module_map=module_map) 2025-05-07T20:31:47.3299751Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:47.3300120Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:47.3300396Z E ^ 2025-05-07T20:31:47.3300870Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:47.3301344Z 2025-05-07T20:31:47.3301783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:47.3302323Z 2025-05-07T20:31:47.3302432Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.3302871Z self=, 2025-05-07T20:31:47.3303286Z T=1, 2025-05-07T20:31:47.3303480Z D=7168, 2025-05-07T20:31:47.3303686Z scale_ub=None, 2025-05-07T20:31:47.3303908Z contiguous=True, 2025-05-07T20:31:47.3304141Z compiled=True, 2025-05-07T20:31:47.3304355Z ) 2025-05-07T20:31:47.3304682Z self = 2025-05-07T20:31:47.3305182Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:47.3305544Z 2025-05-07T20:31:47.3305625Z @given( 2025-05-07T20:31:47.3305868Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.3306511Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.3306842Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.3307183Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.3307517Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.3307812Z ) 2025-05-07T20:31:47.3308172Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.3308625Z def test_silu_mul_quant( 2025-05-07T20:31:47.3308879Z self, 2025-05-07T20:31:47.3309085Z T: int, 2025-05-07T20:31:47.3309286Z D: int, 2025-05-07T20:31:47.3309516Z scale_ub: Optional[float], 2025-05-07T20:31:47.3309798Z contiguous: bool, 2025-05-07T20:31:47.3310041Z compiled: bool, 2025-05-07T20:31:47.3310281Z ) -> None: 2025-05-07T20:31:47.3310504Z torch.manual_seed(2025) 2025-05-07T20:31:47.3310754Z 2025-05-07T20:31:47.3311030Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.3311385Z 2025-05-07T20:31:47.3311594Z x_sign = torch.sign(x) 2025-05-07T20:31:47.3311891Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:47.3312211Z x = x_sign * x_clamp 2025-05-07T20:31:47.3312467Z x0 = x[:, :D] 2025-05-07T20:31:47.3312688Z x1 = x[:, D:] 2025-05-07T20:31:47.3312907Z 2025-05-07T20:31:47.3313103Z if contiguous: 2025-05-07T20:31:47.3313339Z x0 = x0.contiguous() 2025-05-07T20:31:47.3313610Z x1 = x1.contiguous() 2025-05-07T20:31:47.3313863Z 2025-05-07T20:31:47.3314059Z if scale_ub is not None: 2025-05-07T20:31:47.3314339Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:47.3314693Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:47.3315012Z ) 2025-05-07T20:31:47.3315215Z else: 2025-05-07T20:31:47.3315434Z scale_ub_tensor = None 2025-05-07T20:31:47.3315685Z 2025-05-07T20:31:47.3316064Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:47.3316398Z op = silu_mul_quant 2025-05-07T20:31:47.3316659Z if compiled: 2025-05-07T20:31:47.3316913Z op = torch.compile(op) 2025-05-07T20:31:47.3317227Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.3317517Z 2025-05-07T20:31:47.3317711Z y_fp8, y_scale = fn() 2025-05-07T20:31:47.3318006Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:47.3318314Z 2025-05-07T20:31:47.3318554Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:47.3318900Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:47.3319210Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:47.3319533Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:47.3319913Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:47.3320238Z 2025-05-07T20:31:47.3320444Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:47.3320651Z 2025-05-07T20:31:47.3320757Z moe/activation_test.py:126: 2025-05-07T20:31:47.3321066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.3321416Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:47.3321748Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:47.3322568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:47.3323356Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:47.3323924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:47.3324624Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:47.3325467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:47.3326220Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:47.3326972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:47.3327638Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:47.3328262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:47.3328848Z fn() 2025-05-07T20:31:47.3329370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:47.3329978Z self.fn.run( 2025-05-07T20:31:47.3330462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:47.3331019Z kernel = self.compile( 2025-05-07T20:31:47.3331571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:47.3332334Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:47.3332751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.3332990Z 2025-05-07T20:31:47.3333205Z self = 2025-05-07T20:31:47.3334333Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:47.3335764Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4486da8b80>} 2025-05-07T20:31:47.3337258Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:47.3338322Z context = 2025-05-07T20:31:47.3338636Z 2025-05-07T20:31:47.3338836Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:47.3339382Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:47.3339870Z module_map=module_map) 2025-05-07T20:31:47.3340242Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:47.3340615Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:47.3340894Z E ^ 2025-05-07T20:31:47.3341383Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:47.3341855Z 2025-05-07T20:31:47.3342287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:47.3342826Z 2025-05-07T20:31:47.3342941Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.3343372Z self=, 2025-05-07T20:31:47.3343793Z T=4096, 2025-05-07T20:31:47.3343983Z D=5120, 2025-05-07T20:31:47.3344182Z scale_ub=None, 2025-05-07T20:31:47.3344407Z contiguous=False, 2025-05-07T20:31:47.3344636Z compiled=False, 2025-05-07T20:31:47.3344846Z ) 2025-05-07T20:31:47.6222042Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:47.6223232Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:47.6225053Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:47.6226555Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:47.6227566Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:47.6228974Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:47.6230420Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:47.6231447Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:47.6232743Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:47.6234175Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:47.6235284Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:47.6236774Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:47.6238079Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:47.6239402Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:47.6240648Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:47.6241510Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:47.6242584Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:47.6243645Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:47.6244465Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:31:47.6245723Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:47.6247056Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:47.6248306Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:47.6249444Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:47.6250665Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:47.6252155Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:47.6253253Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:47.6254205Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:47.6254979Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:47.6256028Z W0507 20:31:47.619000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:47.8210579Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:47.8212868Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:47.8215630Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:47.8218882Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:47.8219888Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:47.8221247Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:47.8222690Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:47.8223724Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:47.8225010Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:47.8226442Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:47.8227551Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:47.8228888Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:47.8230348Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:47.8231617Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:47.8232878Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:47.8233733Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:47.8234796Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:47.8235862Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:47.8236686Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:31:47.8237942Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:47.8239276Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:47.8240436Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:47.8241526Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:47.8242826Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:47.8244236Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:47.8245342Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:47.8246288Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:47.8247061Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:47.8248117Z W0507 20:31:47.817000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.1123153Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:48.1124539Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:48.1125959Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:48.1127827Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:48.1128871Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:48.1130235Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:48.1131694Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.1132826Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:48.1134135Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:48.1135593Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.1136717Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:48.1138072Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:48.1139399Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:48.1140836Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:48.1142122Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:48.1142993Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:48.1144081Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:48.1145165Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:48.1146016Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:31:48.1147289Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:48.1148620Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:48.1149804Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:48.1150899Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:48.1152224Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:48.1153648Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:48.1154764Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.1155722Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.1156501Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:48.1157572Z W0507 20:31:48.108000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.1220224Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:48.1221520Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:48.1222910Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:48.1224386Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:48.1225575Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:48.1226934Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:48.1228374Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.1229394Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:48.1230667Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:48.1232110Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.1233207Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:48.1234535Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:48.1235835Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:48.1237239Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:48.1238504Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:48.1239417Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:48.1240488Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:48.1241555Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:48.1242392Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:31:48.1243665Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:48.1244999Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:48.1246169Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:48.1247262Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:48.1248494Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:48.1250453Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:48.1251568Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.1252605Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.1253379Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:48.1254449Z W0507 20:31:48.118000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4725181Z self = 2025-05-07T20:31:49.4725992Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:49.4726388Z 2025-05-07T20:31:49.4726498Z @given( 2025-05-07T20:31:49.4726821Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4727146Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4727463Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4727803Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4728133Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4728425Z ) 2025-05-07T20:31:49.4728784Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4729288Z def test_silu_mul_quant( 2025-05-07T20:31:49.4729528Z self, 2025-05-07T20:31:49.4730110Z T: int, 2025-05-07T20:31:49.4730311Z D: int, 2025-05-07T20:31:49.4730529Z scale_ub: Optional[float], 2025-05-07T20:31:49.4730808Z contiguous: bool, 2025-05-07T20:31:49.4731057Z compiled: bool, 2025-05-07T20:31:49.4731284Z ) -> None: 2025-05-07T20:31:49.4731504Z torch.manual_seed(2025) 2025-05-07T20:31:49.4731851Z 2025-05-07T20:31:49.4732158Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4732656Z 2025-05-07T20:31:49.4732887Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4733242Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4733567Z x = x_sign * x_clamp 2025-05-07T20:31:49.4733813Z x0 = x[:, :D] 2025-05-07T20:31:49.4734028Z x1 = x[:, D:] 2025-05-07T20:31:49.4734259Z 2025-05-07T20:31:49.4734755Z if contiguous: 2025-05-07T20:31:49.4735047Z x0 = x0.contiguous() 2025-05-07T20:31:49.4735306Z x1 = x1.contiguous() 2025-05-07T20:31:49.4735562Z 2025-05-07T20:31:49.4735754Z if scale_ub is not None: 2025-05-07T20:31:49.4736025Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4736379Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4736695Z ) 2025-05-07T20:31:49.4736891Z else: 2025-05-07T20:31:49.4737106Z scale_ub_tensor = None 2025-05-07T20:31:49.4737362Z 2025-05-07T20:31:49.4737594Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4737917Z op = silu_mul_quant 2025-05-07T20:31:49.4738169Z if compiled: 2025-05-07T20:31:49.4738413Z op = torch.compile(op) 2025-05-07T20:31:49.4738718Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4738997Z 2025-05-07T20:31:49.4739192Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.4739359Z 2025-05-07T20:31:49.4739461Z moe/activation_test.py:117: 2025-05-07T20:31:49.4739767Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4740106Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.4740386Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4741294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.4742017Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.4742568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4743266Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4743957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4744508Z kernel = self.compile( 2025-05-07T20:31:49.4745059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4745745Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4746154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4746389Z 2025-05-07T20:31:49.4746614Z self = 2025-05-07T20:31:49.4747729Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4749293Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4477e7c180>} 2025-05-07T20:31:49.4750890Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4752055Z context = 2025-05-07T20:31:49.4752351Z 2025-05-07T20:31:49.4752532Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4753066Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4753545Z module_map=module_map) 2025-05-07T20:31:49.4753918Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4754278Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.4754537Z E ^ 2025-05-07T20:31:49.4755015Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4755479Z 2025-05-07T20:31:49.4755917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4756449Z 2025-05-07T20:31:49.4756560Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4756977Z self=, 2025-05-07T20:31:49.4757391Z T=4096, 2025-05-07T20:31:49.4757587Z D=7168, 2025-05-07T20:31:49.4757777Z scale_ub=None, 2025-05-07T20:31:49.4758002Z contiguous=False, 2025-05-07T20:31:49.4758235Z compiled=False, 2025-05-07T20:31:49.4758438Z ) 2025-05-07T20:31:49.4758764Z self = 2025-05-07T20:31:49.4759280Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:49.4759560Z 2025-05-07T20:31:49.4759637Z @given( 2025-05-07T20:31:49.4759871Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4760193Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4760504Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4760835Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4761174Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4761462Z ) 2025-05-07T20:31:49.4761812Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4762378Z def test_silu_mul_quant( 2025-05-07T20:31:49.4762645Z self, 2025-05-07T20:31:49.4762847Z T: int, 2025-05-07T20:31:49.4763055Z D: int, 2025-05-07T20:31:49.4763277Z scale_ub: Optional[float], 2025-05-07T20:31:49.4763545Z contiguous: bool, 2025-05-07T20:31:49.4763788Z compiled: bool, 2025-05-07T20:31:49.4764012Z ) -> None: 2025-05-07T20:31:49.4764227Z torch.manual_seed(2025) 2025-05-07T20:31:49.4764468Z 2025-05-07T20:31:49.4764743Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4765084Z 2025-05-07T20:31:49.4765280Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4765574Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4765894Z x = x_sign * x_clamp 2025-05-07T20:31:49.4766130Z x0 = x[:, :D] 2025-05-07T20:31:49.4766347Z x1 = x[:, D:] 2025-05-07T20:31:49.4766559Z 2025-05-07T20:31:49.4766737Z if contiguous: 2025-05-07T20:31:49.4766977Z x0 = x0.contiguous() 2025-05-07T20:31:49.4767237Z x1 = x1.contiguous() 2025-05-07T20:31:49.4767473Z 2025-05-07T20:31:49.4767667Z if scale_ub is not None: 2025-05-07T20:31:49.4767941Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4768275Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4768587Z ) 2025-05-07T20:31:49.4768780Z else: 2025-05-07T20:31:49.4769004Z scale_ub_tensor = None 2025-05-07T20:31:49.4769260Z 2025-05-07T20:31:49.4769495Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4769813Z op = silu_mul_quant 2025-05-07T20:31:49.4770070Z if compiled: 2025-05-07T20:31:49.4770410Z op = torch.compile(op) 2025-05-07T20:31:49.4770712Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4770986Z 2025-05-07T20:31:49.4771184Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.4771351Z 2025-05-07T20:31:49.4771466Z moe/activation_test.py:117: 2025-05-07T20:31:49.4771838Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4772227Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.4772519Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4773231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.4773936Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.4774492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4775199Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4775892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4776435Z kernel = self.compile( 2025-05-07T20:31:49.4776996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4777674Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4778079Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4778319Z 2025-05-07T20:31:49.4778527Z self = 2025-05-07T20:31:49.4779698Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4781128Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4477e7cfe0>} 2025-05-07T20:31:49.4782614Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4783673Z context = 2025-05-07T20:31:49.4783976Z 2025-05-07T20:31:49.4784146Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4784688Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4785174Z module_map=module_map) 2025-05-07T20:31:49.4785549Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4785916Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.4786186Z E ^ 2025-05-07T20:31:49.4786665Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4787133Z 2025-05-07T20:31:49.4787567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4788105Z 2025-05-07T20:31:49.4788209Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4788630Z self=, 2025-05-07T20:31:49.4789039Z T=128, 2025-05-07T20:31:49.4789251Z D=7168, 2025-05-07T20:31:49.4789467Z scale_ub=None, 2025-05-07T20:31:49.4789677Z contiguous=False, 2025-05-07T20:31:49.4789903Z compiled=True, 2025-05-07T20:31:49.4790112Z ) 2025-05-07T20:31:49.5376096Z self = 2025-05-07T20:31:49.5376860Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:49.5377604Z 2025-05-07T20:31:49.5377681Z @given( 2025-05-07T20:31:49.5377913Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5378227Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5378540Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5378870Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5379205Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5379533Z ) 2025-05-07T20:31:49.5379889Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5380343Z def test_silu_mul_quant( 2025-05-07T20:31:49.5380580Z self, 2025-05-07T20:31:49.5380782Z T: int, 2025-05-07T20:31:49.5380983Z D: int, 2025-05-07T20:31:49.5381199Z scale_ub: Optional[float], 2025-05-07T20:31:49.5381474Z contiguous: bool, 2025-05-07T20:31:49.5381720Z compiled: bool, 2025-05-07T20:31:49.5381948Z ) -> None: 2025-05-07T20:31:49.5382172Z torch.manual_seed(2025) 2025-05-07T20:31:49.5382422Z 2025-05-07T20:31:49.5382693Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5383041Z 2025-05-07T20:31:49.5383235Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5383534Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5383843Z x = x_sign * x_clamp 2025-05-07T20:31:49.5384087Z x0 = x[:, :D] 2025-05-07T20:31:49.5384304Z x1 = x[:, D:] 2025-05-07T20:31:49.5384505Z 2025-05-07T20:31:49.5384690Z if contiguous: 2025-05-07T20:31:49.5384922Z x0 = x0.contiguous() 2025-05-07T20:31:49.5385173Z x1 = x1.contiguous() 2025-05-07T20:31:49.5385420Z 2025-05-07T20:31:49.5385615Z if scale_ub is not None: 2025-05-07T20:31:49.5385881Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5386220Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5386532Z ) 2025-05-07T20:31:49.5386727Z else: 2025-05-07T20:31:49.5386940Z scale_ub_tensor = None 2025-05-07T20:31:49.5387194Z 2025-05-07T20:31:49.5387419Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5387890Z op = silu_mul_quant 2025-05-07T20:31:49.5388146Z if compiled: 2025-05-07T20:31:49.5388400Z op = torch.compile(op) 2025-05-07T20:31:49.5388698Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5388976Z 2025-05-07T20:31:49.5389171Z y_fp8, y_scale = fn() 2025-05-07T20:31:49.5389460Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:49.5389758Z 2025-05-07T20:31:49.5390001Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5390335Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:49.5390638Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:49.5390958Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:49.5391323Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.5391642Z 2025-05-07T20:31:49.5391852Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:49.5392050Z 2025-05-07T20:31:49.5392163Z moe/activation_test.py:126: 2025-05-07T20:31:49.5392464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5392809Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:49.5393148Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.5393961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:49.5394745Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:49.5395310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5396016Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5396817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:49.5397565Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.5398332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:49.5398997Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:49.5399661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:49.5400198Z fn() 2025-05-07T20:31:49.5400721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:49.5401318Z self.fn.run( 2025-05-07T20:31:49.5401801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5402352Z kernel = self.compile( 2025-05-07T20:31:49.5402904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5403583Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5403991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5404226Z 2025-05-07T20:31:49.5404445Z self = 2025-05-07T20:31:49.5405555Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5407290Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4477e7dee0>} 2025-05-07T20:31:49.5408692Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5409934Z context = 2025-05-07T20:31:49.5410230Z 2025-05-07T20:31:49.5410406Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5410938Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5411418Z module_map=module_map) 2025-05-07T20:31:49.5411854Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5412214Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:49.5412485Z E ^ 2025-05-07T20:31:49.5412961Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5413431Z 2025-05-07T20:31:49.5413866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5414394Z 2025-05-07T20:31:49.5414503Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5414927Z self=, 2025-05-07T20:31:49.5415341Z T=128, 2025-05-07T20:31:49.5415523Z D=7168, 2025-05-07T20:31:49.5415718Z scale_ub=None, 2025-05-07T20:31:49.5415946Z contiguous=False, 2025-05-07T20:31:49.5416176Z compiled=False, 2025-05-07T20:31:49.5416378Z ) 2025-05-07T20:31:49.7432555Z self = 2025-05-07T20:31:49.7433345Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:49.7433715Z 2025-05-07T20:31:49.7433795Z @given( 2025-05-07T20:31:49.7434033Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.7434755Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.7435066Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.7435403Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.7435737Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.7436024Z ) 2025-05-07T20:31:49.7436382Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.7436839Z def test_silu_mul_quant( 2025-05-07T20:31:49.7437081Z self, 2025-05-07T20:31:49.7437282Z T: int, 2025-05-07T20:31:49.7437484Z D: int, 2025-05-07T20:31:49.7437699Z scale_ub: Optional[float], 2025-05-07T20:31:49.7437974Z contiguous: bool, 2025-05-07T20:31:49.7438220Z compiled: bool, 2025-05-07T20:31:49.7438445Z ) -> None: 2025-05-07T20:31:49.7438666Z torch.manual_seed(2025) 2025-05-07T20:31:49.7438913Z 2025-05-07T20:31:49.7439183Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.7439586Z 2025-05-07T20:31:49.7439788Z x_sign = torch.sign(x) 2025-05-07T20:31:49.7440080Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.7440399Z x = x_sign * x_clamp 2025-05-07T20:31:49.7440653Z x0 = x[:, :D] 2025-05-07T20:31:49.7440867Z x1 = x[:, D:] 2025-05-07T20:31:49.7441078Z 2025-05-07T20:31:49.7441265Z if contiguous: 2025-05-07T20:31:49.7441494Z x0 = x0.contiguous() 2025-05-07T20:31:49.7441756Z x1 = x1.contiguous() 2025-05-07T20:31:49.7442001Z 2025-05-07T20:31:49.7442185Z if scale_ub is not None: 2025-05-07T20:31:49.7442461Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.7442806Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.7443121Z ) 2025-05-07T20:31:49.7443311Z else: 2025-05-07T20:31:49.7443524Z scale_ub_tensor = None 2025-05-07T20:31:49.7443782Z 2025-05-07T20:31:49.7444016Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.7444340Z op = silu_mul_quant 2025-05-07T20:31:49.7444600Z if compiled: 2025-05-07T20:31:49.7444847Z op = torch.compile(op) 2025-05-07T20:31:49.7445305Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.7445590Z 2025-05-07T20:31:49.7445780Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.7445981Z 2025-05-07T20:31:49.7446083Z moe/activation_test.py:117: 2025-05-07T20:31:49.7446393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.7454951Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.7455286Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.7456015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.7456730Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.7457281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.7457998Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.7458694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.7459242Z kernel = self.compile( 2025-05-07T20:31:49.7459811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.7460494Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.7460912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.7461152Z 2025-05-07T20:31:49.7461366Z self = 2025-05-07T20:31:49.7462494Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.7464058Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4477844cc0>} 2025-05-07T20:31:49.7465453Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.7466521Z context = 2025-05-07T20:31:49.7466818Z 2025-05-07T20:31:49.7466991Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.7467534Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.7468017Z module_map=module_map) 2025-05-07T20:31:49.7468393Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.7468755Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.7469019Z E ^ 2025-05-07T20:31:49.7469505Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.7469981Z 2025-05-07T20:31:49.7470412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.7470944Z 2025-05-07T20:31:49.7471057Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.7471478Z self=, 2025-05-07T20:31:49.7471894Z T=4096, 2025-05-07T20:31:49.7472093Z D=5120, 2025-05-07T20:31:49.7472287Z scale_ub=1200.0, 2025-05-07T20:31:49.7472518Z contiguous=True, 2025-05-07T20:31:49.7472745Z compiled=False, 2025-05-07T20:31:49.7472953Z ) 2025-05-07T20:31:49.7473286Z self = 2025-05-07T20:31:49.7473807Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:49.7474091Z 2025-05-07T20:31:49.7474177Z @given( 2025-05-07T20:31:49.7474494Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.7474819Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.7475137Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.7475476Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.7475820Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.7476115Z ) 2025-05-07T20:31:49.7476469Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.7476925Z def test_silu_mul_quant( 2025-05-07T20:31:49.7477175Z self, 2025-05-07T20:31:49.7477374Z T: int, 2025-05-07T20:31:49.7477570Z D: int, 2025-05-07T20:31:49.7477794Z scale_ub: Optional[float], 2025-05-07T20:31:49.7478075Z contiguous: bool, 2025-05-07T20:31:49.7478313Z compiled: bool, 2025-05-07T20:31:49.7478549Z ) -> None: 2025-05-07T20:31:49.7478766Z torch.manual_seed(2025) 2025-05-07T20:31:49.7479006Z 2025-05-07T20:31:49.7479290Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.7479640Z 2025-05-07T20:31:49.7479828Z x_sign = torch.sign(x) 2025-05-07T20:31:49.7480131Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.7480445Z x = x_sign * x_clamp 2025-05-07T20:31:49.7480683Z x0 = x[:, :D] 2025-05-07T20:31:49.7480903Z x1 = x[:, D:] 2025-05-07T20:31:49.7481113Z 2025-05-07T20:31:49.7481296Z if contiguous: 2025-05-07T20:31:49.7481530Z x0 = x0.contiguous() 2025-05-07T20:31:49.7481796Z x1 = x1.contiguous() 2025-05-07T20:31:49.7482031Z 2025-05-07T20:31:49.7482226Z if scale_ub is not None: 2025-05-07T20:31:49.7482593Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.7482930Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.7483239Z ) 2025-05-07T20:31:49.7483433Z else: 2025-05-07T20:31:49.7483655Z scale_ub_tensor = None 2025-05-07T20:31:49.7483904Z 2025-05-07T20:31:49.7484140Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.7484465Z op = silu_mul_quant 2025-05-07T20:31:49.7484717Z if compiled: 2025-05-07T20:31:49.7484968Z op = torch.compile(op) 2025-05-07T20:31:49.7485274Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.7485547Z 2025-05-07T20:31:49.7485746Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.7485915Z 2025-05-07T20:31:49.7486024Z moe/activation_test.py:117: 2025-05-07T20:31:49.7486327Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.7486674Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.7486968Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.7487688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.7488403Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.7488963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.7489721Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.7490412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.7490960Z kernel = self.compile( 2025-05-07T20:31:49.7491520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.7492290Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.7492701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.7492941Z 2025-05-07T20:31:49.7493153Z self = 2025-05-07T20:31:49.7494363Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.7495793Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4477844e00>} 2025-05-07T20:31:49.7497192Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.7498253Z context = 2025-05-07T20:31:49.7498563Z 2025-05-07T20:31:49.7498733Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.7499296Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.7499814Z module_map=module_map) 2025-05-07T20:31:49.7500180Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.7500544Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.7500810Z E ^ 2025-05-07T20:31:49.7501285Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.7501761Z 2025-05-07T20:31:49.7502194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.7502731Z 2025-05-07T20:31:49.7502834Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.7503259Z self=, 2025-05-07T20:31:49.7504196Z T=1, 2025-05-07T20:31:49.7504386Z D=5120, 2025-05-07T20:31:49.7504581Z scale_ub=None, 2025-05-07T20:31:49.7504792Z contiguous=True, 2025-05-07T20:31:49.7505026Z compiled=True, 2025-05-07T20:31:49.7505234Z ) 2025-05-07T20:31:49.9859106Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:49.9860456Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:49.9861860Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:49.9863369Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:49.9864419Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:49.9865792Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:49.9867252Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9868290Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:49.9869856Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:49.9871586Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9872914Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:49.9874517Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:49.9876073Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:49.9877609Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:49.9879110Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:49.9880127Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:49.9881398Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:49.9882663Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:49.9883788Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:31:49.9885308Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:49.9886908Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:49.9888296Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:49.9889587Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:49.9891064Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:49.9892705Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:49.9893826Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9894796Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9895580Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:49.9896659Z W0507 20:31:49.982000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:50.0572219Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:50.0573444Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:50.0575020Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:50.0576518Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:50.0577542Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.0578904Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:50.0580364Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:50.0581396Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.0582686Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:50.0584288Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:50.0585407Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.0586750Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:50.0588056Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:50.0589367Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:50.0590671Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:50.0591539Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.0592612Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:50.0593685Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:50.0594525Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:31:50.0595880Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:50.0597226Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:50.0598398Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:50.0599538Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:50.0600773Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:50.0602202Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:50.0603315Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:50.0604272Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:50.0605053Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:50.0606113Z W0507 20:31:50.054000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:50.2648173Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:50.2649987Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:50.2651399Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:50.2652961Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:50.2653972Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.2655337Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:50.2656768Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:50.2657782Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.2659059Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:50.2660491Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:50.2661966Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.2663296Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:50.2664579Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:50.2665847Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:50.2667100Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:50.2667954Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.2669008Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:50.2670103Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:50.2670923Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:31:50.2672175Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:50.2673657Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:50.2674802Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:50.2675874Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:50.2677088Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:50.2678496Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:50.2679594Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:50.2680524Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:50.2681284Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:50.2682329Z W0507 20:31:50.261000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:50.2755570Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:50.2756823Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:50.2758378Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:50.2759862Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:50.2760865Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.2762213Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:50.2763654Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:50.2764659Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.2765922Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:50.2767346Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:50.2768594Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.2769971Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:50.2771256Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:50.2772626Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:50.2773887Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:50.2774751Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.2775822Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:50.2776873Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:50.2777700Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:31:50.2778955Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:50.2780289Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:50.2781527Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:50.2782604Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:50.2783823Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:50.2785232Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:50.2786334Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:50.2787274Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:50.2788041Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:50.2789099Z W0507 20:31:50.272000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:50.5000885Z self = 2025-05-07T20:31:50.5001654Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:50.5002028Z 2025-05-07T20:31:50.5002139Z @given( 2025-05-07T20:31:50.5002723Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:50.5003051Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:50.5003359Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:50.5003713Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:50.5004057Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:50.5004344Z ) 2025-05-07T20:31:50.5004707Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:50.5005173Z def test_silu_mul_quant( 2025-05-07T20:31:50.5005423Z self, 2025-05-07T20:31:50.5005631Z T: int, 2025-05-07T20:31:50.5005841Z D: int, 2025-05-07T20:31:50.5006066Z scale_ub: Optional[float], 2025-05-07T20:31:50.5006613Z contiguous: bool, 2025-05-07T20:31:50.5006866Z compiled: bool, 2025-05-07T20:31:50.5007100Z ) -> None: 2025-05-07T20:31:50.5007332Z torch.manual_seed(2025) 2025-05-07T20:31:50.5007591Z 2025-05-07T20:31:50.5007873Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:50.5008224Z 2025-05-07T20:31:50.5008427Z x_sign = torch.sign(x) 2025-05-07T20:31:50.5008734Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:50.5009049Z x = x_sign * x_clamp 2025-05-07T20:31:50.5009297Z x0 = x[:, :D] 2025-05-07T20:31:50.5009522Z x1 = x[:, D:] 2025-05-07T20:31:50.5009732Z 2025-05-07T20:31:50.5009925Z if contiguous: 2025-05-07T20:31:50.5010165Z x0 = x0.contiguous() 2025-05-07T20:31:50.5010431Z x1 = x1.contiguous() 2025-05-07T20:31:50.5010680Z 2025-05-07T20:31:50.5010886Z if scale_ub is not None: 2025-05-07T20:31:50.5011162Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:50.5011510Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:50.5011903Z ) 2025-05-07T20:31:50.5012102Z else: 2025-05-07T20:31:50.5012325Z scale_ub_tensor = None 2025-05-07T20:31:50.5012587Z 2025-05-07T20:31:50.5012821Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:50.5013148Z op = silu_mul_quant 2025-05-07T20:31:50.5013628Z if compiled: 2025-05-07T20:31:50.5013905Z op = torch.compile(op) 2025-05-07T20:31:50.5014231Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:50.5014543Z 2025-05-07T20:31:50.5014751Z y_fp8, y_scale = fn() 2025-05-07T20:31:50.5015061Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:50.5015389Z 2025-05-07T20:31:50.5015650Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:50.5016029Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:50.5016362Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:50.5016719Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:50.5017123Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:50.5017485Z 2025-05-07T20:31:50.5017708Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:50.5017930Z 2025-05-07T20:31:50.5018048Z moe/activation_test.py:126: 2025-05-07T20:31:50.5018386Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:50.5018779Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:50.5019151Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:50.5020157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:50.5021082Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:50.5021734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:50.5022562Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:50.5023392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:50.5024397Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:50.5025295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:50.5026068Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:50.5026782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:50.5027400Z fn() 2025-05-07T20:31:50.5028003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:50.5028699Z self.fn.run( 2025-05-07T20:31:50.5029248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:50.5029934Z kernel = self.compile( 2025-05-07T20:31:50.5030586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:50.5031367Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:50.5031831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:50.5032101Z 2025-05-07T20:31:50.5032343Z self = 2025-05-07T20:31:50.5033676Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:50.5035450Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4477846fc0>} 2025-05-07T20:31:50.5036851Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:50.5038034Z context = 2025-05-07T20:31:50.5038338Z 2025-05-07T20:31:50.5038520Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:50.5039065Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:50.5039566Z module_map=module_map) 2025-05-07T20:31:50.5040002Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:50.5040370Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:50.5040651Z E ^ 2025-05-07T20:31:50.5041143Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:50.5041622Z 2025-05-07T20:31:50.5042069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:50.5042624Z 2025-05-07T20:31:50.5042730Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:50.5043175Z self=, 2025-05-07T20:31:50.5043601Z T=2048, 2025-05-07T20:31:50.5043802Z D=5120, 2025-05-07T20:31:50.5043999Z scale_ub=None, 2025-05-07T20:31:50.5044232Z contiguous=True, 2025-05-07T20:31:50.5044468Z compiled=True, 2025-05-07T20:31:50.5044680Z ) 2025-05-07T20:31:50.7328816Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:50.7330782Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:50.7333678Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:50.7337163Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:50.7339164Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.7340714Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:50.7342151Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:50.7343184Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.7344453Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:50.7345877Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:50.7346980Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.7348310Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:50.7349757Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:50.7351036Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:50.7352292Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:50.7353158Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.7354227Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:50.7355303Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:50.7356126Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:31:50.7357388Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:50.7358723Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:50.7359885Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:50.7361093Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:50.7362335Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:50.7363758Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:50.7364874Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:50.7365825Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:50.7374077Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:50.7375184Z W0507 20:31:50.729000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:50.8040198Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:50.8041356Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:50.8042747Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:50.8044708Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:50.8045733Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.8047088Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:50.8048535Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:50.8049587Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.8050894Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:50.8052435Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:50.8053543Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.8054877Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:50.8056333Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:50.8057607Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:50.8058872Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:50.8059732Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.8060797Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:50.8061859Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:50.8062689Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:31:50.8063942Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:50.8065278Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:50.8066444Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:50.8067543Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:50.8068848Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:50.8070312Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:50.8071419Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:50.8072367Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:50.8073134Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:50.8074195Z W0507 20:31:50.800000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:51.0107933Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:51.0109136Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:51.0110543Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:51.0112055Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:51.0113535Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.0114897Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:51.0116405Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:51.0117441Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.0118741Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:51.0120245Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:51.0121365Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.0122702Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:51.0124014Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:51.0125462Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:51.0126734Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:51.0127600Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.0128676Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:51.0129749Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:51.0130640Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:31:51.0131975Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:51.0133312Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:51.0134484Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:51.0135583Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:51.0136905Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:51.0138332Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:51.0139442Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:51.0140449Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:51.0141222Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:51.0142286Z W0507 20:31:51.007000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:51.0209141Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:51.0210249Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:51.0211647Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:51.0213191Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:51.0214227Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.0215769Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:51.0217222Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:51.0218256Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.0219549Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:51.0221062Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:51.0222176Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.0223514Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:51.0224827Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:51.0226109Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:51.0227501Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:51.0228370Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.0229438Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:51.0230557Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:51.0231390Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:31:51.0232668Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:51.0234015Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:51.0235184Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:51.0236280Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:51.0237519Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:51.0239034Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:51.0240193Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:51.0241146Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:51.0241925Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:51.0242992Z W0507 20:31:51.018000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:51.2336784Z self = 2025-05-07T20:31:51.2337314Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:51.2337620Z 2025-05-07T20:31:51.2337711Z @given( 2025-05-07T20:31:51.2337959Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:51.2338279Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:51.2338587Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:51.2338924Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:51.2339257Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:51.2339537Z ) 2025-05-07T20:31:51.2339891Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:51.2340349Z def test_silu_mul_quant( 2025-05-07T20:31:51.2340592Z self, 2025-05-07T20:31:51.2340790Z T: int, 2025-05-07T20:31:51.2340997Z D: int, 2025-05-07T20:31:51.2341392Z scale_ub: Optional[float], 2025-05-07T20:31:51.2341657Z contiguous: bool, 2025-05-07T20:31:51.2341895Z compiled: bool, 2025-05-07T20:31:51.2342131Z ) -> None: 2025-05-07T20:31:51.2342344Z torch.manual_seed(2025) 2025-05-07T20:31:51.2342594Z 2025-05-07T20:31:51.2342873Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:51.2343215Z 2025-05-07T20:31:51.2343409Z x_sign = torch.sign(x) 2025-05-07T20:31:51.2343705Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:51.2344014Z x = x_sign * x_clamp 2025-05-07T20:31:51.2344257Z x0 = x[:, :D] 2025-05-07T20:31:51.2344476Z x1 = x[:, D:] 2025-05-07T20:31:51.2344682Z 2025-05-07T20:31:51.2344868Z if contiguous: 2025-05-07T20:31:51.2345103Z x0 = x0.contiguous() 2025-05-07T20:31:51.2345361Z x1 = x1.contiguous() 2025-05-07T20:31:51.2345607Z 2025-05-07T20:31:51.2345800Z if scale_ub is not None: 2025-05-07T20:31:51.2346084Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:51.2346421Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:51.2346768Z ) 2025-05-07T20:31:51.2346967Z else: 2025-05-07T20:31:51.2347187Z scale_ub_tensor = None 2025-05-07T20:31:51.2347440Z 2025-05-07T20:31:51.2347680Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:51.2348004Z op = silu_mul_quant 2025-05-07T20:31:51.2348252Z if compiled: 2025-05-07T20:31:51.2348502Z op = torch.compile(op) 2025-05-07T20:31:51.2348807Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:51.2349084Z 2025-05-07T20:31:51.2349283Z y_fp8, y_scale = fn() 2025-05-07T20:31:51.2349576Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:51.2349866Z 2025-05-07T20:31:51.2350135Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:51.2350511Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:51.2350805Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:51.2351129Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:51.2351680Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:51.2352001Z 2025-05-07T20:31:51.2352203Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:51.2352406Z 2025-05-07T20:31:51.2352507Z moe/activation_test.py:126: 2025-05-07T20:31:51.2352811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:51.2353151Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:51.2353488Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:51.2354308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:51.2355088Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:51.2355652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:51.2356356Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:51.2357075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:51.2357816Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:51.2358572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:51.2359231Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:51.2359853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:51.2360383Z fn() 2025-05-07T20:31:51.2360904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:51.2361590Z self.fn.run( 2025-05-07T20:31:51.2362074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:51.2362621Z kernel = self.compile( 2025-05-07T20:31:51.2363177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:51.2363853Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:51.2364255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:51.2364497Z 2025-05-07T20:31:51.2364708Z self = 2025-05-07T20:31:51.2365833Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:51.2367268Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44777f54e0>} 2025-05-07T20:31:51.2368671Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:51.2369727Z context = 2025-05-07T20:31:51.2370032Z 2025-05-07T20:31:51.2370242Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:51.2370799Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:51.2371282Z module_map=module_map) 2025-05-07T20:31:51.2371655Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:51.2372094Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:51.2372375Z E ^ 2025-05-07T20:31:51.2372857Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:51.2373323Z 2025-05-07T20:31:51.2373841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:51.2374375Z 2025-05-07T20:31:51.2374480Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:51.2374911Z self=, 2025-05-07T20:31:51.2375330Z T=128, 2025-05-07T20:31:51.2375517Z D=5120, 2025-05-07T20:31:51.2375717Z scale_ub=None, 2025-05-07T20:31:51.2375940Z contiguous=True, 2025-05-07T20:31:51.2376162Z compiled=True, 2025-05-07T20:31:51.2376372Z ) 2025-05-07T20:31:51.4753332Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:51.4754455Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:51.4755859Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:51.4757336Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:51.4758356Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.4759732Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:51.4761378Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:51.4762415Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.4763697Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:51.4765148Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:51.4766272Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.4767621Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:51.4768934Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:51.4770207Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:51.4771475Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:51.4772429Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.4773623Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:51.4774700Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:51.4775530Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:31:51.4776801Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:51.4778155Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:51.4779336Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:51.4780430Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:51.4781672Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:51.4783100Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:51.4784305Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:51.4785266Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:51.4786041Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:51.4787114Z W0507 20:31:51.472000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:51.5465792Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:51.5467985Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:51.5470241Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:51.5471721Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:51.5472740Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.5474098Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:51.5475551Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:51.5476727Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.5478012Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:51.5479447Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:51.5480561Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.5481913Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:51.5483220Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:51.5484497Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:51.5485755Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:51.5486620Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.5487816Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:51.5488888Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:51.5489718Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:31:51.5490985Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:51.5492387Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:51.5493564Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:51.5494667Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:51.5495895Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:51.5497313Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:51.5498430Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:51.5499387Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:51.5500239Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:51.5501307Z W0507 20:31:51.543000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:51.7544475Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:51.7545583Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:51.7546976Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:51.7548473Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:51.7549484Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.7550846Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:51.7552289Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:51.7553487Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.7554770Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:51.7556202Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:51.7557315Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.7558651Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:51.7560012Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:51.7561304Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:51.7562577Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:51.7563446Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.7564523Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:51.7565718Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:51.7566557Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:31:51.7567827Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:51.7569177Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:51.7570352Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:51.7571457Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:51.7572772Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:51.7574192Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:51.7575313Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:51.7576271Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:51.7577166Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:51.7578234Z W0507 20:31:51.751000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:51.7646849Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:51.7647959Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:51.7649353Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:51.7650841Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:51.7651921Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.7653277Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:51.7654721Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:51.7655748Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.7657173Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:51.7658624Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:51.7659735Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.7661076Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:51.7662389Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:51.7663675Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:51.7664939Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:51.7665803Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.7666876Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:51.7668059Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:51.7668895Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:31:51.7670203Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:51.7671545Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:51.7672718Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:51.7673812Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:51.7675055Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:51.7676469Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:51.7677582Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:51.7678534Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:51.7679306Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:51.7680508Z W0507 20:31:51.761000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:52.0155339Z self = 2025-05-07T20:31:52.0156438Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:52.0156996Z 2025-05-07T20:31:52.0157158Z @given( 2025-05-07T20:31:52.0157635Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:52.0158273Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:52.0158882Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:52.0159555Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:52.0160222Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:52.0160600Z ) 2025-05-07T20:31:52.0160987Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:52.0161458Z def test_silu_mul_quant( 2025-05-07T20:31:52.0161706Z self, 2025-05-07T20:31:52.0161911Z T: int, 2025-05-07T20:31:52.0162120Z D: int, 2025-05-07T20:31:52.0162349Z scale_ub: Optional[float], 2025-05-07T20:31:52.0162629Z contiguous: bool, 2025-05-07T20:31:52.0162879Z compiled: bool, 2025-05-07T20:31:52.0163106Z ) -> None: 2025-05-07T20:31:52.0163333Z torch.manual_seed(2025) 2025-05-07T20:31:52.0163588Z 2025-05-07T20:31:52.0163869Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:52.0164218Z 2025-05-07T20:31:52.0164420Z x_sign = torch.sign(x) 2025-05-07T20:31:52.0164720Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:52.0165037Z x = x_sign * x_clamp 2025-05-07T20:31:52.0165294Z x0 = x[:, :D] 2025-05-07T20:31:52.0165522Z x1 = x[:, D:] 2025-05-07T20:31:52.0165900Z 2025-05-07T20:31:52.0166094Z if contiguous: 2025-05-07T20:31:52.0166335Z x0 = x0.contiguous() 2025-05-07T20:31:52.0166600Z x1 = x1.contiguous() 2025-05-07T20:31:52.0166853Z 2025-05-07T20:31:52.0167061Z if scale_ub is not None: 2025-05-07T20:31:52.0167340Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:52.0167689Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:52.0168013Z ) 2025-05-07T20:31:52.0168209Z else: 2025-05-07T20:31:52.0168441Z scale_ub_tensor = None 2025-05-07T20:31:52.0168696Z 2025-05-07T20:31:52.0168938Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:52.0169266Z op = silu_mul_quant 2025-05-07T20:31:52.0169522Z if compiled: 2025-05-07T20:31:52.0177042Z op = torch.compile(op) 2025-05-07T20:31:52.0177358Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:52.0177647Z 2025-05-07T20:31:52.0177858Z y_fp8, y_scale = fn() 2025-05-07T20:31:52.0178153Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:52.0178457Z 2025-05-07T20:31:52.0178707Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:52.0179057Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:52.0179371Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:52.0179697Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:52.0180112Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:52.0180435Z 2025-05-07T20:31:52.0180646Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:52.0180851Z 2025-05-07T20:31:52.0180967Z moe/activation_test.py:126: 2025-05-07T20:31:52.0181276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:52.0181630Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:52.0181974Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:52.0182799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:52.0183582Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:52.0184317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:52.0185027Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:52.0185744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:52.0186495Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:52.0187248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:52.0187913Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:52.0188547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:52.0189085Z fn() 2025-05-07T20:31:52.0189611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:52.0190222Z self.fn.run( 2025-05-07T20:31:52.0190710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:52.0191258Z kernel = self.compile( 2025-05-07T20:31:52.0191819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:52.0192501Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:52.0192917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:52.0193155Z 2025-05-07T20:31:52.0193368Z self = 2025-05-07T20:31:52.0194609Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:52.0196039Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4476b59ee0>} 2025-05-07T20:31:52.0197441Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:52.0198506Z context = 2025-05-07T20:31:52.0198803Z 2025-05-07T20:31:52.0198974Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:52.0199521Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:52.0200008Z module_map=module_map) 2025-05-07T20:31:52.0200397Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:52.0200804Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:52.0201077Z E ^ 2025-05-07T20:31:52.0201555Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:52.0202019Z 2025-05-07T20:31:52.0202449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:52.0202986Z 2025-05-07T20:31:52.0203091Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:52.0203519Z self=, 2025-05-07T20:31:52.0203937Z T=4096, 2025-05-07T20:31:52.0204125Z D=5120, 2025-05-07T20:31:52.0204328Z scale_ub=None, 2025-05-07T20:31:52.0204551Z contiguous=True, 2025-05-07T20:31:52.0204772Z compiled=True, 2025-05-07T20:31:52.0204983Z ) 2025-05-07T20:31:52.2631617Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:52.2632733Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:52.2634115Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:52.2635586Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:52.2636595Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:52.2637958Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:52.2639405Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:52.2640423Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:52.2641700Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:52.2643268Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:52.2644378Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:52.2645718Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:52.2647014Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:52.2648287Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:52.2649562Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:52.2650478Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:52.2651545Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:52.2652667Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:52.2653497Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:31:52.2654847Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:52.2656188Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:52.2657359Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:52.2658443Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:52.2659677Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:52.2661155Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:52.2662268Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:52.2663210Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:52.2663982Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:52.2665043Z W0507 20:31:52.260000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:52.3344735Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:52.3346001Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:52.3347377Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:52.3348855Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:52.3349867Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:52.3351279Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:52.3352714Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:52.3353731Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:52.3354996Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:52.3356431Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:52.3357684Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:52.3359014Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:52.3360360Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:52.3361623Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:52.3362892Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:52.3363758Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:52.3364825Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:52.3365879Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:52.3366702Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:31:52.3367963Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:52.3369414Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:52.3370625Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:52.3371702Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:52.3373009Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:52.3374426Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:52.3375539Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:52.3376484Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:52.3377256Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:52.3378312Z W0507 20:31:52.331000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:52.5438035Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:52.5439164Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:52.5441071Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:52.5444037Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:52.5446060Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:52.5448770Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:52.5450891Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:52.5451983Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:52.5453268Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:52.5454711Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:52.5455959Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:52.5457306Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:52.5458614Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:52.5459897Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:52.5461216Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:52.5462087Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:52.5463169Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:52.5464238Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:52.5465072Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:31:52.5466339Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:52.5467680Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:52.5468934Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:52.5470035Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:52.5471272Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:52.5472699Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:52.5473815Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:52.5474778Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:52.5475555Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:52.5476626Z W0507 20:31:52.540000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:52.5534819Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:52.5535930Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:52.5537486Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:52.5538972Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:52.5539988Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:52.5541398Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:52.5542853Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:52.5543889Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:52.5545175Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:52.5546619Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:52.5547733Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:52.5549190Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:52.5550509Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:52.5551839Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:52.5553108Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:52.5553977Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:52.5555060Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:31:52.5556130Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:52.5556963Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:31:52.5558234Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:52.5559575Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:52.5560885Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:31:52.5561983Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:52.5563225Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:52.5564653Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:52.5565765Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:52.5566728Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:52.5567508Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:52.5568579Z W0507 20:31:52.550000 96677 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:52.8091874Z self = 2025-05-07T20:31:52.8092411Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:52.8092739Z 2025-05-07T20:31:52.8092825Z @given( 2025-05-07T20:31:52.8093066Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:52.8093386Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:52.8093695Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:52.8094046Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:52.8094378Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:52.8094668Z ) 2025-05-07T20:31:52.8095213Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:52.8095680Z def test_silu_mul_quant( 2025-05-07T20:31:52.8095931Z self, 2025-05-07T20:31:52.8096129Z T: int, 2025-05-07T20:31:52.8096330Z D: int, 2025-05-07T20:31:52.8096551Z scale_ub: Optional[float], 2025-05-07T20:31:52.8096821Z contiguous: bool, 2025-05-07T20:31:52.8097067Z compiled: bool, 2025-05-07T20:31:52.8097298Z ) -> None: 2025-05-07T20:31:52.8097512Z torch.manual_seed(2025) 2025-05-07T20:31:52.8097759Z 2025-05-07T20:31:52.8098035Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:52.8098384Z 2025-05-07T20:31:52.8098587Z x_sign = torch.sign(x) 2025-05-07T20:31:52.8098891Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:52.8099203Z x = x_sign * x_clamp 2025-05-07T20:31:52.8099449Z x0 = x[:, :D] 2025-05-07T20:31:52.8099670Z x1 = x[:, D:] 2025-05-07T20:31:52.8099885Z 2025-05-07T20:31:52.8100070Z if contiguous: 2025-05-07T20:31:52.8100307Z x0 = x0.contiguous() 2025-05-07T20:31:52.8100577Z x1 = x1.contiguous() 2025-05-07T20:31:52.8100816Z 2025-05-07T20:31:52.8101011Z if scale_ub is not None: 2025-05-07T20:31:52.8101292Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:52.8101633Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:52.8101946Z ) 2025-05-07T20:31:52.8102146Z else: 2025-05-07T20:31:52.8102354Z scale_ub_tensor = None 2025-05-07T20:31:52.8102609Z 2025-05-07T20:31:52.8102846Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:52.8103292Z op = silu_mul_quant 2025-05-07T20:31:52.8103549Z if compiled: 2025-05-07T20:31:52.8103802Z op = torch.compile(op) 2025-05-07T20:31:52.8104100Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:52.8104379Z 2025-05-07T20:31:52.8104582Z y_fp8, y_scale = fn() 2025-05-07T20:31:52.8104875Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:52.8105169Z 2025-05-07T20:31:52.8105410Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:52.8105750Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:52.8106046Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:52.8106547Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:52.8106922Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:52.8107234Z 2025-05-07T20:31:52.8107441Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:52.8107639Z 2025-05-07T20:31:52.8107745Z moe/activation_test.py:126: 2025-05-07T20:31:52.8108052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:52.8108400Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:52.8108740Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:52.8109554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:52.8110382Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:52.8110943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:52.8111648Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:52.8112357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:52.8113102Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:52.8113856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:52.8114700Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:52.8115322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:52.8115859Z fn() 2025-05-07T20:31:52.8116386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:52.8116990Z self.fn.run( 2025-05-07T20:31:52.8117469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:52.8118018Z kernel = self.compile( 2025-05-07T20:31:52.8118575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:52.8119244Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:52.8119659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:52.8119898Z 2025-05-07T20:31:52.8120128Z self = 2025-05-07T20:31:52.8121285Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:52.8122712Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44768a8ea0>} 2025-05-07T20:31:52.8124104Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:52.8125290Z context = 2025-05-07T20:31:52.8125596Z 2025-05-07T20:31:52.8125768Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:52.8126315Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:52.8126801Z module_map=module_map) 2025-05-07T20:31:52.8127179Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:52.8127555Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:52.8127826Z E ^ 2025-05-07T20:31:52.8128325Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:52.8128797Z 2025-05-07T20:31:52.8129230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:52.8129766Z 2025-05-07T20:31:52.8129879Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:52.8130312Z self=, 2025-05-07T20:31:52.8130778Z T=16384, 2025-05-07T20:31:52.8130981Z D=5120, 2025-05-07T20:31:52.8131179Z scale_ub=None, 2025-05-07T20:31:52.8131404Z contiguous=True, 2025-05-07T20:31:52.8131632Z compiled=True, 2025-05-07T20:31:52.8131905Z ) 2025-05-07T20:31:52.8431044Z W0507 20:31:52.841000 96677 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:31:52.8433648Z W0507 20:31:52.841000 96677 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:31:52.8436442Z W0507 20:31:52.841000 96677 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:31:52.8438522Z W0507 20:31:52.841000 96677 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:31:52.8440888Z W0507 20:31:52.841000 96677 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:31:52.9317279Z self = 2025-05-07T20:31:52.9317838Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:52.9318135Z 2025-05-07T20:31:52.9318224Z @given( 2025-05-07T20:31:52.9318458Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:52.9318769Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:52.9319076Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:52.9319410Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:52.9319734Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:52.9320030Z ) 2025-05-07T20:31:52.9320383Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:52.9320832Z def test_silu_mul_quant( 2025-05-07T20:31:52.9321078Z self, 2025-05-07T20:31:52.9321267Z T: int, 2025-05-07T20:31:52.9321461Z D: int, 2025-05-07T20:31:52.9321675Z scale_ub: Optional[float], 2025-05-07T20:31:52.9321946Z contiguous: bool, 2025-05-07T20:31:52.9322181Z compiled: bool, 2025-05-07T20:31:52.9322409Z ) -> None: 2025-05-07T20:31:52.9322647Z torch.manual_seed(2025) 2025-05-07T20:31:52.9322897Z 2025-05-07T20:31:52.9323170Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:52.9323521Z 2025-05-07T20:31:52.9323716Z x_sign = torch.sign(x) 2025-05-07T20:31:52.9324012Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:52.9324321Z x = x_sign * x_clamp 2025-05-07T20:31:52.9324733Z x0 = x[:, :D] 2025-05-07T20:31:52.9324952Z x1 = x[:, D:] 2025-05-07T20:31:52.9325155Z 2025-05-07T20:31:52.9325344Z if contiguous: 2025-05-07T20:31:52.9325576Z x0 = x0.contiguous() 2025-05-07T20:31:52.9325836Z x1 = x1.contiguous() 2025-05-07T20:31:52.9326078Z 2025-05-07T20:31:52.9326269Z if scale_ub is not None: 2025-05-07T20:31:52.9326540Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:52.9326883Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:52.9327197Z ) 2025-05-07T20:31:52.9327390Z else: 2025-05-07T20:31:52.9327601Z scale_ub_tensor = None 2025-05-07T20:31:52.9327858Z 2025-05-07T20:31:52.9328086Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:52.9328405Z op = silu_mul_quant 2025-05-07T20:31:52.9328656Z if compiled: 2025-05-07T20:31:52.9328907Z op = torch.compile(op) 2025-05-07T20:31:52.9329210Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:52.9329493Z 2025-05-07T20:31:52.9329685Z y_fp8, y_scale = fn() 2025-05-07T20:31:52.9329975Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:52.9330276Z 2025-05-07T20:31:52.9330523Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:52.9330863Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:52.9331161Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:52.9331481Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:52.9331935Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:52.9332254Z 2025-05-07T20:31:52.9332454Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:52.9332654Z 2025-05-07T20:31:52.9332757Z moe/activation_test.py:126: 2025-05-07T20:31:52.9333053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:52.9333400Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:52.9333734Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:52.9334675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:52.9335465Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:52.9336032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:52.9336742Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:52.9337454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:52.9338204Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:52.9338966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:52.9339637Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:52.9340284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:52.9340855Z fn() 2025-05-07T20:31:52.9341383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:52.9341988Z self.fn.run( 2025-05-07T20:31:52.9342469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:52.9343019Z kernel = self.compile( 2025-05-07T20:31:52.9343569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:52.9344249Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:52.9344657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:52.9344979Z 2025-05-07T20:31:52.9345193Z self = 2025-05-07T20:31:52.9346317Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:52.9347747Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44768a9bc0>} 2025-05-07T20:31:52.9349152Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:52.9350245Z context = 2025-05-07T20:31:52.9350567Z 2025-05-07T20:31:52.9350742Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:52.9351281Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:52.9351768Z module_map=module_map) 2025-05-07T20:31:52.9352142Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:52.9352512Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:52.9352789Z E ^ 2025-05-07T20:31:52.9353271Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:52.9353739Z 2025-05-07T20:31:52.9354172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:52.9361008Z 2025-05-07T20:31:52.9361140Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:52.9361575Z self=, 2025-05-07T20:31:52.9361998Z T=1, 2025-05-07T20:31:52.9362198Z D=5120, 2025-05-07T20:31:52.9362390Z scale_ub=1200.0, 2025-05-07T20:31:52.9362619Z contiguous=True, 2025-05-07T20:31:52.9362845Z compiled=True, 2025-05-07T20:31:52.9363049Z ) 2025-05-07T20:31:53.0739362Z self = 2025-05-07T20:31:53.0739915Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:53.0740198Z 2025-05-07T20:31:53.0740277Z @given( 2025-05-07T20:31:53.0740511Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:53.0740841Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:53.0741174Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:53.0741509Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:53.0741836Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:53.0742123Z ) 2025-05-07T20:31:53.0742476Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:53.0742934Z def test_silu_mul_quant( 2025-05-07T20:31:53.0743182Z self, 2025-05-07T20:31:53.0743381Z T: int, 2025-05-07T20:31:53.0743577Z D: int, 2025-05-07T20:31:53.0743803Z scale_ub: Optional[float], 2025-05-07T20:31:53.0744080Z contiguous: bool, 2025-05-07T20:31:53.0744319Z compiled: bool, 2025-05-07T20:31:53.0744552Z ) -> None: 2025-05-07T20:31:53.0744773Z torch.manual_seed(2025) 2025-05-07T20:31:53.0745018Z 2025-05-07T20:31:53.0745292Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:53.0745638Z 2025-05-07T20:31:53.0745836Z x_sign = torch.sign(x) 2025-05-07T20:31:53.0746127Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:53.0746443Z x = x_sign * x_clamp 2025-05-07T20:31:53.0746684Z x0 = x[:, :D] 2025-05-07T20:31:53.0746897Z x1 = x[:, D:] 2025-05-07T20:31:53.0747111Z 2025-05-07T20:31:53.0747429Z if contiguous: 2025-05-07T20:31:53.0747661Z x0 = x0.contiguous() 2025-05-07T20:31:53.0747928Z x1 = x1.contiguous() 2025-05-07T20:31:53.0748174Z 2025-05-07T20:31:53.0748366Z if scale_ub is not None: 2025-05-07T20:31:53.0748643Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:53.0748983Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:53.0749293Z ) 2025-05-07T20:31:53.0749495Z else: 2025-05-07T20:31:53.0749714Z scale_ub_tensor = None 2025-05-07T20:31:53.0749988Z 2025-05-07T20:31:53.0750219Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:53.0750538Z op = silu_mul_quant 2025-05-07T20:31:53.0750793Z if compiled: 2025-05-07T20:31:53.0751039Z op = torch.compile(op) 2025-05-07T20:31:53.0751342Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.0751622Z 2025-05-07T20:31:53.0751812Z > y_fp8, y_scale = fn() 2025-05-07T20:31:53.0751990Z 2025-05-07T20:31:53.0752093Z moe/activation_test.py:117: 2025-05-07T20:31:53.0752396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.0752741Z moe/activation_test.py:115: in fn 2025-05-07T20:31:53.0753024Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.0753603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:53.0754178Z return fn(*args, **kwargs) 2025-05-07T20:31:53.0754853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:53.0755561Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:53.0756109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:53.0756815Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:53.0757498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:53.0758047Z kernel = self.compile( 2025-05-07T20:31:53.0758687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:53.0759365Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.0759767Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.0760006Z 2025-05-07T20:31:53.0760216Z self = 2025-05-07T20:31:53.0761388Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:53.0762812Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4476765620>} 2025-05-07T20:31:53.0764207Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:53.0765267Z context = 2025-05-07T20:31:53.0765566Z 2025-05-07T20:31:53.0765735Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:53.0766281Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.0766759Z module_map=module_map) 2025-05-07T20:31:53.0767130Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.0767489Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:53.0767745Z E ^ 2025-05-07T20:31:53.0768308Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.0768779Z 2025-05-07T20:31:53.0769215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:53.0769743Z 2025-05-07T20:31:53.0769853Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:53.0770276Z self=, 2025-05-07T20:31:53.0770688Z T=1, 2025-05-07T20:31:53.0770874Z D=5120, 2025-05-07T20:31:53.0771092Z scale_ub=None, 2025-05-07T20:31:53.0771337Z contiguous=False, 2025-05-07T20:31:53.0771560Z compiled=True, 2025-05-07T20:31:53.0771821Z ) 2025-05-07T20:31:53.1378450Z self = 2025-05-07T20:31:53.1379185Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:53.1379533Z 2025-05-07T20:31:53.1379648Z @given( 2025-05-07T20:31:53.1379888Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:53.1380212Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:53.1380519Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:53.1380868Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:53.1381226Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:53.1381517Z ) 2025-05-07T20:31:53.1381872Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:53.1382335Z def test_silu_mul_quant( 2025-05-07T20:31:53.1382587Z self, 2025-05-07T20:31:53.1382792Z T: int, 2025-05-07T20:31:53.1382989Z D: int, 2025-05-07T20:31:53.1383220Z scale_ub: Optional[float], 2025-05-07T20:31:53.1383505Z contiguous: bool, 2025-05-07T20:31:53.1383747Z compiled: bool, 2025-05-07T20:31:53.1383991Z ) -> None: 2025-05-07T20:31:53.1384218Z torch.manual_seed(2025) 2025-05-07T20:31:53.1384468Z 2025-05-07T20:31:53.1384753Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:53.1385114Z 2025-05-07T20:31:53.1385311Z x_sign = torch.sign(x) 2025-05-07T20:31:53.1385982Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:53.1386309Z x = x_sign * x_clamp 2025-05-07T20:31:53.1386555Z x0 = x[:, :D] 2025-05-07T20:31:53.1386780Z x1 = x[:, D:] 2025-05-07T20:31:53.1386997Z 2025-05-07T20:31:53.1387185Z if contiguous: 2025-05-07T20:31:53.1387429Z x0 = x0.contiguous() 2025-05-07T20:31:53.1387701Z x1 = x1.contiguous() 2025-05-07T20:31:53.1387941Z 2025-05-07T20:31:53.1388139Z if scale_ub is not None: 2025-05-07T20:31:53.1388421Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:53.1388772Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:53.1389080Z ) 2025-05-07T20:31:53.1389278Z else: 2025-05-07T20:31:53.1389501Z scale_ub_tensor = None 2025-05-07T20:31:53.1389754Z 2025-05-07T20:31:53.1389995Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:53.1390321Z op = silu_mul_quant 2025-05-07T20:31:53.1390576Z if compiled: 2025-05-07T20:31:53.1390833Z op = torch.compile(op) 2025-05-07T20:31:53.1391141Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.1391418Z 2025-05-07T20:31:53.1391622Z y_fp8, y_scale = fn() 2025-05-07T20:31:53.1391919Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:53.1392210Z 2025-05-07T20:31:53.1392454Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:53.1392800Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:53.1393102Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:53.1393425Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:53.1393799Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:53.1394285Z 2025-05-07T20:31:53.1394484Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:53.1394694Z 2025-05-07T20:31:53.1394799Z moe/activation_test.py:126: 2025-05-07T20:31:53.1395110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.1395452Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:53.1395793Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:53.1396615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:53.1397403Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:53.1397966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:53.1398680Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:53.1399403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:53.1400154Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:53.1400909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:53.1401574Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:53.1402200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:53.1402729Z fn() 2025-05-07T20:31:53.1403257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:53.1403861Z self.fn.run( 2025-05-07T20:31:53.1404349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:53.1404898Z kernel = self.compile( 2025-05-07T20:31:53.1405462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:53.1406642Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.1407074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.1407321Z 2025-05-07T20:31:53.1407537Z self = 2025-05-07T20:31:53.1408661Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:53.1410100Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4476766e80>} 2025-05-07T20:31:53.1411495Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:53.1412649Z context = 2025-05-07T20:31:53.1412955Z 2025-05-07T20:31:53.1413129Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:53.1413691Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.1414184Z module_map=module_map) 2025-05-07T20:31:53.1414574Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.1414937Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:53.1415217Z E ^ 2025-05-07T20:31:53.1415704Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.1416170Z 2025-05-07T20:31:53.1416749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:53.1417289Z 2025-05-07T20:31:53.1417398Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:53.1417834Z self=, 2025-05-07T20:31:53.1418254Z T=1, 2025-05-07T20:31:53.1418438Z D=5120, 2025-05-07T20:31:53.1418643Z scale_ub=None, 2025-05-07T20:31:53.1418867Z contiguous=True, 2025-05-07T20:31:53.1419094Z compiled=False, 2025-05-07T20:31:53.1419311Z ) 2025-05-07T20:31:53.2909057Z self = 2025-05-07T20:31:53.2909807Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:53.2910178Z 2025-05-07T20:31:53.2910285Z @given( 2025-05-07T20:31:53.2910587Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:53.2911027Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:53.2911470Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:53.2911880Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:53.2912214Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:53.2912506Z ) 2025-05-07T20:31:53.2912866Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:53.2913327Z def test_silu_mul_quant( 2025-05-07T20:31:53.2913575Z self, 2025-05-07T20:31:53.2913773Z T: int, 2025-05-07T20:31:53.2913977Z D: int, 2025-05-07T20:31:53.2914203Z scale_ub: Optional[float], 2025-05-07T20:31:53.2914475Z contiguous: bool, 2025-05-07T20:31:53.2914722Z compiled: bool, 2025-05-07T20:31:53.2914960Z ) -> None: 2025-05-07T20:31:53.2915175Z torch.manual_seed(2025) 2025-05-07T20:31:53.2915425Z 2025-05-07T20:31:53.2915704Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:53.2916054Z 2025-05-07T20:31:53.2916257Z x_sign = torch.sign(x) 2025-05-07T20:31:53.2916562Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:53.2916874Z x = x_sign * x_clamp 2025-05-07T20:31:53.2917464Z x0 = x[:, :D] 2025-05-07T20:31:53.2917699Z x1 = x[:, D:] 2025-05-07T20:31:53.2917906Z 2025-05-07T20:31:53.2918101Z if contiguous: 2025-05-07T20:31:53.2918345Z x0 = x0.contiguous() 2025-05-07T20:31:53.2918604Z x1 = x1.contiguous() 2025-05-07T20:31:53.2918855Z 2025-05-07T20:31:53.2919057Z if scale_ub is not None: 2025-05-07T20:31:53.2919342Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:53.2919679Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:53.2919998Z ) 2025-05-07T20:31:53.2920201Z else: 2025-05-07T20:31:53.2920435Z scale_ub_tensor = None 2025-05-07T20:31:53.2920721Z 2025-05-07T20:31:53.2920988Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:53.2921325Z op = silu_mul_quant 2025-05-07T20:31:53.2921586Z if compiled: 2025-05-07T20:31:53.2921837Z op = torch.compile(op) 2025-05-07T20:31:53.2922150Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.2922439Z 2025-05-07T20:31:53.2922634Z > y_fp8, y_scale = fn() 2025-05-07T20:31:53.2922812Z 2025-05-07T20:31:53.2922914Z moe/activation_test.py:117: 2025-05-07T20:31:53.2923223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.2923559Z moe/activation_test.py:115: in fn 2025-05-07T20:31:53.2923855Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.2924572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:53.2925287Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:53.2925835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:53.2926698Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:53.2927398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:53.2927946Z kernel = self.compile( 2025-05-07T20:31:53.2928508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:53.2929190Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.2929603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.2929841Z 2025-05-07T20:31:53.2930054Z self = 2025-05-07T20:31:53.2931175Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:53.2932761Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4476765ee0>} 2025-05-07T20:31:53.2934161Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:53.2935228Z context = 2025-05-07T20:31:53.2935525Z 2025-05-07T20:31:53.2935698Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:53.2936243Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.2936728Z module_map=module_map) 2025-05-07T20:31:53.2937096Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.2937468Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:53.2937741Z E ^ 2025-05-07T20:31:53.2938351Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.2938822Z 2025-05-07T20:31:53.2939252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:53.2939792Z 2025-05-07T20:31:53.2939898Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:53.2940335Z self=, 2025-05-07T20:31:53.2940759Z T=128, 2025-05-07T20:31:53.2940952Z D=5120, 2025-05-07T20:31:53.2941154Z scale_ub=None, 2025-05-07T20:31:53.2941386Z contiguous=False, 2025-05-07T20:31:53.2941618Z compiled=True, 2025-05-07T20:31:53.2941833Z ) 2025-05-07T20:31:53.2942171Z self = 2025-05-07T20:31:53.2942685Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:53.2942975Z 2025-05-07T20:31:53.2943057Z @given( 2025-05-07T20:31:53.2943310Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:53.2943630Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:53.2943953Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:53.2944296Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:53.2944645Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:53.2944942Z ) 2025-05-07T20:31:53.2945310Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:53.2945770Z def test_silu_mul_quant( 2025-05-07T20:31:53.2946017Z self, 2025-05-07T20:31:53.2946226Z T: int, 2025-05-07T20:31:53.2946438Z D: int, 2025-05-07T20:31:53.2946659Z scale_ub: Optional[float], 2025-05-07T20:31:53.2947035Z contiguous: bool, 2025-05-07T20:31:53.2947288Z compiled: bool, 2025-05-07T20:31:53.2947515Z ) -> None: 2025-05-07T20:31:53.2947739Z torch.manual_seed(2025) 2025-05-07T20:31:53.2947992Z 2025-05-07T20:31:53.2948270Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:53.2948624Z 2025-05-07T20:31:53.2948828Z x_sign = torch.sign(x) 2025-05-07T20:31:53.2949120Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:53.2949441Z x = x_sign * x_clamp 2025-05-07T20:31:53.2949697Z x0 = x[:, :D] 2025-05-07T20:31:53.2949924Z x1 = x[:, D:] 2025-05-07T20:31:53.2950136Z 2025-05-07T20:31:53.2950351Z if contiguous: 2025-05-07T20:31:53.2950622Z x0 = x0.contiguous() 2025-05-07T20:31:53.2950883Z x1 = x1.contiguous() 2025-05-07T20:31:53.2951139Z 2025-05-07T20:31:53.2951340Z if scale_ub is not None: 2025-05-07T20:31:53.2951618Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:53.2951972Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:53.2952290Z ) 2025-05-07T20:31:53.2952489Z else: 2025-05-07T20:31:53.2952712Z scale_ub_tensor = None 2025-05-07T20:31:53.2952978Z 2025-05-07T20:31:53.2953215Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:53.2953546Z op = silu_mul_quant 2025-05-07T20:31:53.2953809Z if compiled: 2025-05-07T20:31:53.2954061Z op = torch.compile(op) 2025-05-07T20:31:53.2954371Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.2954659Z 2025-05-07T20:31:53.2954855Z > y_fp8, y_scale = fn() 2025-05-07T20:31:53.2955032Z 2025-05-07T20:31:53.2955136Z moe/activation_test.py:117: 2025-05-07T20:31:53.2955446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.2955793Z moe/activation_test.py:115: in fn 2025-05-07T20:31:53.2956079Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.2956661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:53.2957245Z return fn(*args, **kwargs) 2025-05-07T20:31:53.2958004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:53.2958724Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:53.2959283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:53.2959989Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:53.2960674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:53.2961230Z kernel = self.compile( 2025-05-07T20:31:53.2961792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:53.2962483Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.2962889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.2963143Z 2025-05-07T20:31:53.2963356Z self = 2025-05-07T20:31:53.2964476Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:53.2965899Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44767a5da0>} 2025-05-07T20:31:53.2967289Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:53.2968437Z context = 2025-05-07T20:31:53.2968739Z 2025-05-07T20:31:53.2968915Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:53.2969458Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.2969936Z module_map=module_map) 2025-05-07T20:31:53.2970329Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.2970726Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:53.2970988Z E ^ 2025-05-07T20:31:53.2971473Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.2972024Z 2025-05-07T20:31:53.2972454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:53.2972990Z 2025-05-07T20:31:53.2973107Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:53.2973530Z self=, 2025-05-07T20:31:53.2973954Z T=128, 2025-05-07T20:31:53.2974158Z D=7168, 2025-05-07T20:31:53.2974356Z scale_ub=1200.0, 2025-05-07T20:31:53.2974595Z contiguous=False, 2025-05-07T20:31:53.2974833Z compiled=False, 2025-05-07T20:31:53.2975039Z ) 2025-05-07T20:31:53.4097394Z self = 2025-05-07T20:31:53.4098150Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:53.4098488Z 2025-05-07T20:31:53.4098571Z @given( 2025-05-07T20:31:53.4098818Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:53.4099144Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:53.4099451Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:53.4099791Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:53.4100156Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:53.4100469Z ) 2025-05-07T20:31:53.4101239Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:53.4101722Z def test_silu_mul_quant( 2025-05-07T20:31:53.4101981Z self, 2025-05-07T20:31:53.4102184Z T: int, 2025-05-07T20:31:53.4102389Z D: int, 2025-05-07T20:31:53.4112458Z scale_ub: Optional[float], 2025-05-07T20:31:53.4112789Z contiguous: bool, 2025-05-07T20:31:53.4113057Z compiled: bool, 2025-05-07T20:31:53.4113307Z ) -> None: 2025-05-07T20:31:53.4113531Z torch.manual_seed(2025) 2025-05-07T20:31:53.4113790Z 2025-05-07T20:31:53.4114080Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:53.4114466Z 2025-05-07T20:31:53.4114674Z x_sign = torch.sign(x) 2025-05-07T20:31:53.4114973Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:53.4115320Z x = x_sign * x_clamp 2025-05-07T20:31:53.4115581Z x0 = x[:, :D] 2025-05-07T20:31:53.4115803Z x1 = x[:, D:] 2025-05-07T20:31:53.4116031Z 2025-05-07T20:31:53.4116239Z if contiguous: 2025-05-07T20:31:53.4116475Z x0 = x0.contiguous() 2025-05-07T20:31:53.4116750Z x1 = x1.contiguous() 2025-05-07T20:31:53.4116991Z 2025-05-07T20:31:53.4117197Z if scale_ub is not None: 2025-05-07T20:31:53.4117485Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:53.4117831Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:53.4118157Z ) 2025-05-07T20:31:53.4118367Z else: 2025-05-07T20:31:53.4118581Z scale_ub_tensor = None 2025-05-07T20:31:53.4118844Z 2025-05-07T20:31:53.4119088Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:53.4119420Z op = silu_mul_quant 2025-05-07T20:31:53.4119674Z if compiled: 2025-05-07T20:31:53.4120175Z op = torch.compile(op) 2025-05-07T20:31:53.4120487Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.4120767Z 2025-05-07T20:31:53.4120972Z > y_fp8, y_scale = fn() 2025-05-07T20:31:53.4121145Z 2025-05-07T20:31:53.4121258Z moe/activation_test.py:117: 2025-05-07T20:31:53.4121562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.4121913Z moe/activation_test.py:115: in fn 2025-05-07T20:31:53.4122205Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.4123027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:53.4123867Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:53.4124428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:53.4125135Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:53.4125835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:53.4126393Z kernel = self.compile( 2025-05-07T20:31:53.4126963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:53.4127648Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.4128057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.4128302Z 2025-05-07T20:31:53.4128517Z self = 2025-05-07T20:31:53.4129641Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:53.4131083Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4475e7ab60>} 2025-05-07T20:31:53.4132755Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:53.4133830Z context = 2025-05-07T20:31:53.4134134Z 2025-05-07T20:31:53.4134305Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:53.4134847Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.4135323Z module_map=module_map) 2025-05-07T20:31:53.4135703Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.4136070Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:53.4136345Z E ^ 2025-05-07T20:31:53.4136822Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.4137297Z 2025-05-07T20:31:53.4137731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:53.4138260Z 2025-05-07T20:31:53.4138376Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:53.4138798Z self=, 2025-05-07T20:31:53.4139216Z T=128, 2025-05-07T20:31:53.4139417Z D=5120, 2025-05-07T20:31:53.4139620Z scale_ub=None, 2025-05-07T20:31:53.4139839Z contiguous=False, 2025-05-07T20:31:53.4140076Z compiled=False, 2025-05-07T20:31:53.4140297Z ) 2025-05-07T20:31:53.4140621Z self = 2025-05-07T20:31:53.4141135Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:53.4141501Z 2025-05-07T20:31:53.4141590Z @given( 2025-05-07T20:31:53.4141824Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:53.4142149Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:53.4142471Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:53.4142807Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:53.4143152Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:53.4143446Z ) 2025-05-07T20:31:53.4143809Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:53.4144258Z def test_silu_mul_quant( 2025-05-07T20:31:53.4144513Z self, 2025-05-07T20:31:53.4144718Z T: int, 2025-05-07T20:31:53.4144919Z D: int, 2025-05-07T20:31:53.4145149Z scale_ub: Optional[float], 2025-05-07T20:31:53.4145432Z contiguous: bool, 2025-05-07T20:31:53.4145673Z compiled: bool, 2025-05-07T20:31:53.4145903Z ) -> None: 2025-05-07T20:31:53.4146130Z torch.manual_seed(2025) 2025-05-07T20:31:53.4146373Z 2025-05-07T20:31:53.4146653Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:53.4147004Z 2025-05-07T20:31:53.4147202Z x_sign = torch.sign(x) 2025-05-07T20:31:53.4147503Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:53.4147824Z x = x_sign * x_clamp 2025-05-07T20:31:53.4148065Z x0 = x[:, :D] 2025-05-07T20:31:53.4148290Z x1 = x[:, D:] 2025-05-07T20:31:53.4148508Z 2025-05-07T20:31:53.4148702Z if contiguous: 2025-05-07T20:31:53.4148934Z x0 = x0.contiguous() 2025-05-07T20:31:53.4149202Z x1 = x1.contiguous() 2025-05-07T20:31:53.4149451Z 2025-05-07T20:31:53.4149641Z if scale_ub is not None: 2025-05-07T20:31:53.4149919Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:53.4150259Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:53.4150574Z ) 2025-05-07T20:31:53.4150771Z else: 2025-05-07T20:31:53.4150987Z scale_ub_tensor = None 2025-05-07T20:31:53.4151232Z 2025-05-07T20:31:53.4151472Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:53.4151883Z op = silu_mul_quant 2025-05-07T20:31:53.4152135Z if compiled: 2025-05-07T20:31:53.4152385Z op = torch.compile(op) 2025-05-07T20:31:53.4152686Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.4152960Z 2025-05-07T20:31:53.4153157Z > y_fp8, y_scale = fn() 2025-05-07T20:31:53.4153331Z 2025-05-07T20:31:53.4153431Z moe/activation_test.py:117: 2025-05-07T20:31:53.4153735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.4154071Z moe/activation_test.py:115: in fn 2025-05-07T20:31:53.4154359Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.4155070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:53.4155780Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:53.4156339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:53.4157042Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:53.4157727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:53.4158272Z kernel = self.compile( 2025-05-07T20:31:53.4158830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:53.4159508Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.4159911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.4160156Z 2025-05-07T20:31:53.4160366Z self = 2025-05-07T20:31:53.4161570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:53.4162994Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4475e7b060>} 2025-05-07T20:31:53.4164386Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:53.4165459Z context = 2025-05-07T20:31:53.4165762Z 2025-05-07T20:31:53.4165933Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:53.4166473Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.4166964Z module_map=module_map) 2025-05-07T20:31:53.4167334Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.4167701Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:53.4167965Z E ^ 2025-05-07T20:31:53.4168443Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.4168924Z 2025-05-07T20:31:53.4169359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:53.4169905Z 2025-05-07T20:31:53.4170010Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:53.4170441Z self=, 2025-05-07T20:31:53.4170854Z T=128, 2025-05-07T20:31:53.4171045Z D=5120, 2025-05-07T20:31:53.4171247Z scale_ub=1200.0, 2025-05-07T20:31:53.4171474Z contiguous=True, 2025-05-07T20:31:53.4171700Z compiled=False, 2025-05-07T20:31:53.4172051Z ) 2025-05-07T20:31:53.7981019Z self = 2025-05-07T20:31:53.7982217Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:53.7982515Z 2025-05-07T20:31:53.7982594Z @given( 2025-05-07T20:31:53.7982829Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:53.7983147Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:53.7983457Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:53.7983782Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:53.7984115Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:53.7984399Z ) 2025-05-07T20:31:53.7984748Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:53.7985207Z def test_silu_mul_quant( 2025-05-07T20:31:53.7985470Z self, 2025-05-07T20:31:53.7985665Z T: int, 2025-05-07T20:31:53.7985870Z D: int, 2025-05-07T20:31:53.7986102Z scale_ub: Optional[float], 2025-05-07T20:31:53.7986376Z contiguous: bool, 2025-05-07T20:31:53.7986635Z compiled: bool, 2025-05-07T20:31:53.7986871Z ) -> None: 2025-05-07T20:31:53.7987088Z torch.manual_seed(2025) 2025-05-07T20:31:53.7987342Z 2025-05-07T20:31:53.7987623Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:53.7987971Z 2025-05-07T20:31:53.7988179Z x_sign = torch.sign(x) 2025-05-07T20:31:53.7988479Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:53.7988799Z x = x_sign * x_clamp 2025-05-07T20:31:53.7989049Z x0 = x[:, :D] 2025-05-07T20:31:53.7989271Z x1 = x[:, D:] 2025-05-07T20:31:53.7989484Z 2025-05-07T20:31:53.7989679Z if contiguous: 2025-05-07T20:31:53.7989915Z x0 = x0.contiguous() 2025-05-07T20:31:53.7990345Z x1 = x1.contiguous() 2025-05-07T20:31:53.7990632Z 2025-05-07T20:31:53.7990839Z if scale_ub is not None: 2025-05-07T20:31:53.7991119Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:53.7991468Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:53.7991779Z ) 2025-05-07T20:31:53.7991980Z else: 2025-05-07T20:31:53.7992199Z scale_ub_tensor = None 2025-05-07T20:31:53.7992453Z 2025-05-07T20:31:53.7992694Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:53.7993025Z op = silu_mul_quant 2025-05-07T20:31:53.7993275Z if compiled: 2025-05-07T20:31:53.7993535Z op = torch.compile(op) 2025-05-07T20:31:53.7993836Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.7994112Z 2025-05-07T20:31:53.7994312Z > y_fp8, y_scale = fn() 2025-05-07T20:31:53.7994483Z 2025-05-07T20:31:53.7994584Z moe/activation_test.py:117: 2025-05-07T20:31:53.7994896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.7995233Z moe/activation_test.py:115: in fn 2025-05-07T20:31:53.7995520Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.7996242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:53.7996956Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:53.7997512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:53.7998219Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:53.7998906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:53.7999452Z kernel = self.compile( 2025-05-07T20:31:53.8000012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:53.8000695Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.8001104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.8001457Z 2025-05-07T20:31:53.8001670Z self = 2025-05-07T20:31:53.8002790Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:53.8004237Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4475e78180>} 2025-05-07T20:31:53.8005632Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:53.8007034Z context = 2025-05-07T20:31:53.8007337Z 2025-05-07T20:31:53.8007514Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:53.8008056Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.8008538Z module_map=module_map) 2025-05-07T20:31:53.8008909Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.8009274Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:53.8009546Z E ^ 2025-05-07T20:31:53.8010020Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.8010489Z 2025-05-07T20:31:53.8010921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:53.8011601Z 2025-05-07T20:31:53.8011707Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:53.8012214Z self=, 2025-05-07T20:31:53.8012624Z T=1, 2025-05-07T20:31:53.8012820Z D=7168, 2025-05-07T20:31:53.8013021Z scale_ub=1200.0, 2025-05-07T20:31:53.8013245Z contiguous=True, 2025-05-07T20:31:53.8013481Z compiled=True, 2025-05-07T20:31:53.8013694Z ) 2025-05-07T20:31:53.8014022Z self = 2025-05-07T20:31:53.8014529Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:53.8014809Z 2025-05-07T20:31:53.8014889Z @given( 2025-05-07T20:31:53.8015130Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:53.8015448Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:53.8015765Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:53.8016105Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:53.8016444Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:53.8016737Z ) 2025-05-07T20:31:53.8017102Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:53.8017552Z def test_silu_mul_quant( 2025-05-07T20:31:53.8017804Z self, 2025-05-07T20:31:53.8018005Z T: int, 2025-05-07T20:31:53.8018203Z D: int, 2025-05-07T20:31:53.8018434Z scale_ub: Optional[float], 2025-05-07T20:31:53.8018712Z contiguous: bool, 2025-05-07T20:31:53.8018958Z compiled: bool, 2025-05-07T20:31:53.8019185Z ) -> None: 2025-05-07T20:31:53.8019408Z torch.manual_seed(2025) 2025-05-07T20:31:53.8019658Z 2025-05-07T20:31:53.8019938Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:53.8020298Z 2025-05-07T20:31:53.8020530Z x_sign = torch.sign(x) 2025-05-07T20:31:53.8020834Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:53.8021161Z x = x_sign * x_clamp 2025-05-07T20:31:53.8021407Z x0 = x[:, :D] 2025-05-07T20:31:53.8021623Z x1 = x[:, D:] 2025-05-07T20:31:53.8021836Z 2025-05-07T20:31:53.8022154Z if contiguous: 2025-05-07T20:31:53.8022393Z x0 = x0.contiguous() 2025-05-07T20:31:53.8022659Z x1 = x1.contiguous() 2025-05-07T20:31:53.8022908Z 2025-05-07T20:31:53.8023100Z if scale_ub is not None: 2025-05-07T20:31:53.8023382Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:53.8023731Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:53.8024051Z ) 2025-05-07T20:31:53.8024248Z else: 2025-05-07T20:31:53.8024466Z scale_ub_tensor = None 2025-05-07T20:31:53.8024724Z 2025-05-07T20:31:53.8024961Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:53.8025287Z op = silu_mul_quant 2025-05-07T20:31:53.8025547Z if compiled: 2025-05-07T20:31:53.8025801Z op = torch.compile(op) 2025-05-07T20:31:53.8026107Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.8026391Z 2025-05-07T20:31:53.8026585Z > y_fp8, y_scale = fn() 2025-05-07T20:31:53.8026768Z 2025-05-07T20:31:53.8026870Z moe/activation_test.py:117: 2025-05-07T20:31:53.8027179Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.8027517Z moe/activation_test.py:115: in fn 2025-05-07T20:31:53.8027812Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.8028392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:53.8028980Z return fn(*args, **kwargs) 2025-05-07T20:31:53.8029660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:53.8030376Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:53.8031025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:53.8031730Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:53.8032426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:53.8032978Z kernel = self.compile( 2025-05-07T20:31:53.8033540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:53.8034216Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.8034628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.8034864Z 2025-05-07T20:31:53.8035083Z self = 2025-05-07T20:31:53.8036207Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:53.8037644Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4475e7aca0>} 2025-05-07T20:31:53.8039051Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:53.8040118Z context = 2025-05-07T20:31:53.8040414Z 2025-05-07T20:31:53.8040591Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:53.8041126Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.8041612Z module_map=module_map) 2025-05-07T20:31:53.8041994Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.8042360Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:53.8042626Z E ^ 2025-05-07T20:31:53.8043188Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.8043662Z 2025-05-07T20:31:53.8044101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:53.8044634Z 2025-05-07T20:31:53.8044750Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:53.8045174Z self=, 2025-05-07T20:31:53.8045591Z T=1, 2025-05-07T20:31:53.8045781Z D=7168, 2025-05-07T20:31:53.8045973Z scale_ub=1200.0, 2025-05-07T20:31:53.8046208Z contiguous=False, 2025-05-07T20:31:53.8046439Z compiled=True, 2025-05-07T20:31:53.8046641Z ) 2025-05-07T20:31:53.9407278Z self = 2025-05-07T20:31:53.9408779Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:53.9409524Z 2025-05-07T20:31:53.9409795Z @given( 2025-05-07T20:31:53.9410292Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:53.9410678Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:53.9410992Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:53.9411328Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:53.9411655Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:53.9412036Z ) 2025-05-07T20:31:53.9412399Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:53.9412853Z def test_silu_mul_quant( 2025-05-07T20:31:53.9413107Z self, 2025-05-07T20:31:53.9413315Z T: int, 2025-05-07T20:31:53.9413515Z D: int, 2025-05-07T20:31:53.9414097Z scale_ub: Optional[float], 2025-05-07T20:31:53.9414378Z contiguous: bool, 2025-05-07T20:31:53.9414619Z compiled: bool, 2025-05-07T20:31:53.9414861Z ) -> None: 2025-05-07T20:31:53.9415086Z torch.manual_seed(2025) 2025-05-07T20:31:53.9415339Z 2025-05-07T20:31:53.9415612Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:53.9415964Z 2025-05-07T20:31:53.9416165Z x_sign = torch.sign(x) 2025-05-07T20:31:53.9416459Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:53.9416776Z x = x_sign * x_clamp 2025-05-07T20:31:53.9417023Z x0 = x[:, :D] 2025-05-07T20:31:53.9417238Z x1 = x[:, D:] 2025-05-07T20:31:53.9417451Z 2025-05-07T20:31:53.9417641Z if contiguous: 2025-05-07T20:31:53.9417872Z x0 = x0.contiguous() 2025-05-07T20:31:53.9418139Z x1 = x1.contiguous() 2025-05-07T20:31:53.9418382Z 2025-05-07T20:31:53.9418570Z if scale_ub is not None: 2025-05-07T20:31:53.9418860Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:53.9419202Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:53.9419513Z ) 2025-05-07T20:31:53.9419719Z else: 2025-05-07T20:31:53.9419934Z scale_ub_tensor = None 2025-05-07T20:31:53.9420183Z 2025-05-07T20:31:53.9420421Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:53.9420744Z op = silu_mul_quant 2025-05-07T20:31:53.9421007Z if compiled: 2025-05-07T20:31:53.9421255Z op = torch.compile(op) 2025-05-07T20:31:53.9421563Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.9421850Z 2025-05-07T20:31:53.9422042Z > y_fp8, y_scale = fn() 2025-05-07T20:31:53.9422215Z 2025-05-07T20:31:53.9422316Z moe/activation_test.py:117: 2025-05-07T20:31:53.9422625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.9422986Z moe/activation_test.py:115: in fn 2025-05-07T20:31:53.9423280Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.9423863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:53.9424597Z return fn(*args, **kwargs) 2025-05-07T20:31:53.9425276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:53.9425989Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:53.9426545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:53.9435387Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:53.9436136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:53.9436703Z kernel = self.compile( 2025-05-07T20:31:53.9437275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:53.9437964Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.9438386Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.9438633Z 2025-05-07T20:31:53.9438846Z self = 2025-05-07T20:31:53.9439970Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:53.9441420Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f447521cea0>} 2025-05-07T20:31:53.9442812Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:53.9443995Z context = 2025-05-07T20:31:53.9444309Z 2025-05-07T20:31:53.9444482Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:53.9445028Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.9445513Z module_map=module_map) 2025-05-07T20:31:53.9445895Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.9446260Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:53.9446514Z E ^ 2025-05-07T20:31:53.9447002Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.9447468Z 2025-05-07T20:31:53.9447901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:53.9448437Z 2025-05-07T20:31:53.9448541Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:53.9448976Z self=, 2025-05-07T20:31:53.9449401Z T=1, 2025-05-07T20:31:53.9449583Z D=7168, 2025-05-07T20:31:53.9449780Z scale_ub=None, 2025-05-07T20:31:53.9450002Z contiguous=False, 2025-05-07T20:31:53.9450225Z compiled=True, 2025-05-07T20:31:53.9450447Z ) 2025-05-07T20:31:54.0328777Z self = 2025-05-07T20:31:54.0329542Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:54.0329911Z 2025-05-07T20:31:54.0330022Z @given( 2025-05-07T20:31:54.0330290Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:54.0330834Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:54.0331446Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:54.0332262Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:54.0332924Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:54.0333481Z ) 2025-05-07T20:31:54.0334641Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:54.0335551Z def test_silu_mul_quant( 2025-05-07T20:31:54.0336039Z self, 2025-05-07T20:31:54.0336423Z T: int, 2025-05-07T20:31:54.0336816Z D: int, 2025-05-07T20:31:54.0337252Z scale_ub: Optional[float], 2025-05-07T20:31:54.0337789Z contiguous: bool, 2025-05-07T20:31:54.0338269Z compiled: bool, 2025-05-07T20:31:54.0338724Z ) -> None: 2025-05-07T20:31:54.0339144Z torch.manual_seed(2025) 2025-05-07T20:31:54.0339628Z 2025-05-07T20:31:54.0340173Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:54.0340666Z 2025-05-07T20:31:54.0340891Z x_sign = torch.sign(x) 2025-05-07T20:31:54.0341194Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:54.0341504Z x = x_sign * x_clamp 2025-05-07T20:31:54.0341748Z x0 = x[:, :D] 2025-05-07T20:31:54.0341969Z x1 = x[:, D:] 2025-05-07T20:31:54.0342184Z 2025-05-07T20:31:54.0342373Z if contiguous: 2025-05-07T20:31:54.0342614Z x0 = x0.contiguous() 2025-05-07T20:31:54.0342873Z x1 = x1.contiguous() 2025-05-07T20:31:54.0343119Z 2025-05-07T20:31:54.0343315Z if scale_ub is not None: 2025-05-07T20:31:54.0343594Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:54.0343932Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:54.0344248Z ) 2025-05-07T20:31:54.0344445Z else: 2025-05-07T20:31:54.0344654Z scale_ub_tensor = None 2025-05-07T20:31:54.0344911Z 2025-05-07T20:31:54.0345147Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:54.0345464Z op = silu_mul_quant 2025-05-07T20:31:54.0345890Z if compiled: 2025-05-07T20:31:54.0346146Z op = torch.compile(op) 2025-05-07T20:31:54.0346444Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.0346727Z 2025-05-07T20:31:54.0346933Z y_fp8, y_scale = fn() 2025-05-07T20:31:54.0347222Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:54.0347521Z 2025-05-07T20:31:54.0347771Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:54.0348116Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:54.0348413Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:54.0348739Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:54.0349107Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:54.0349423Z 2025-05-07T20:31:54.0349629Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:54.0349828Z 2025-05-07T20:31:54.0349937Z moe/activation_test.py:126: 2025-05-07T20:31:54.0350248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.0350599Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:54.0350944Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:54.0351763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:54.0352541Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:54.0353109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:54.0353818Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:54.0354528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:54.0355282Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:54.0356049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:54.0356713Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:54.0357421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:54.0357961Z fn() 2025-05-07T20:31:54.0358487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:54.0359091Z self.fn.run( 2025-05-07T20:31:54.0359568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:54.0360115Z kernel = self.compile( 2025-05-07T20:31:54.0360720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:54.0361401Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.0361817Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.0362060Z 2025-05-07T20:31:54.0362277Z self = 2025-05-07T20:31:54.0363407Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:54.0364852Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4475caa3e0>} 2025-05-07T20:31:54.0366252Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:54.0367399Z context = 2025-05-07T20:31:54.0367695Z 2025-05-07T20:31:54.0367877Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:54.0368428Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.0368906Z module_map=module_map) 2025-05-07T20:31:54.0369281Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.0369649Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:54.0369919Z E ^ 2025-05-07T20:31:54.0370397Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.0370912Z 2025-05-07T20:31:54.0371347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:54.0371951Z 2025-05-07T20:31:54.0372067Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:54.0372494Z self=, 2025-05-07T20:31:54.0372916Z T=1, 2025-05-07T20:31:54.0373111Z D=5120, 2025-05-07T20:31:54.0373305Z scale_ub=1200.0, 2025-05-07T20:31:54.0373544Z contiguous=False, 2025-05-07T20:31:54.0373783Z compiled=True, 2025-05-07T20:31:54.0373991Z ) 2025-05-07T20:31:54.1878171Z self = 2025-05-07T20:31:54.1878926Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:54.1879316Z 2025-05-07T20:31:54.1879437Z @given( 2025-05-07T20:31:54.1879689Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:54.1880009Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:54.1880316Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:54.1880670Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:54.1881030Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:54.1881346Z ) 2025-05-07T20:31:54.1881698Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:54.1882139Z def test_silu_mul_quant( 2025-05-07T20:31:54.1882736Z self, 2025-05-07T20:31:54.1882936Z T: int, 2025-05-07T20:31:54.1883125Z D: int, 2025-05-07T20:31:54.1883339Z scale_ub: Optional[float], 2025-05-07T20:31:54.1883607Z contiguous: bool, 2025-05-07T20:31:54.1883838Z compiled: bool, 2025-05-07T20:31:54.1884074Z ) -> None: 2025-05-07T20:31:54.1884288Z torch.manual_seed(2025) 2025-05-07T20:31:54.1884524Z 2025-05-07T20:31:54.1884801Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:54.1885144Z 2025-05-07T20:31:54.1885338Z x_sign = torch.sign(x) 2025-05-07T20:31:54.1885632Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:54.1885975Z x = x_sign * x_clamp 2025-05-07T20:31:54.1886229Z x0 = x[:, :D] 2025-05-07T20:31:54.1886451Z x1 = x[:, D:] 2025-05-07T20:31:54.1886656Z 2025-05-07T20:31:54.1886849Z if contiguous: 2025-05-07T20:31:54.1887089Z x0 = x0.contiguous() 2025-05-07T20:31:54.1887358Z x1 = x1.contiguous() 2025-05-07T20:31:54.1887604Z 2025-05-07T20:31:54.1887803Z if scale_ub is not None: 2025-05-07T20:31:54.1888076Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:54.1888417Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:54.1888733Z ) 2025-05-07T20:31:54.1888930Z else: 2025-05-07T20:31:54.1889142Z scale_ub_tensor = None 2025-05-07T20:31:54.1889399Z 2025-05-07T20:31:54.1889639Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:54.1889959Z op = silu_mul_quant 2025-05-07T20:31:54.1890225Z if compiled: 2025-05-07T20:31:54.1890575Z op = torch.compile(op) 2025-05-07T20:31:54.1891046Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.1891327Z 2025-05-07T20:31:54.1891530Z > y_fp8, y_scale = fn() 2025-05-07T20:31:54.1891699Z 2025-05-07T20:31:54.1891923Z moe/activation_test.py:117: 2025-05-07T20:31:54.1892232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.1892576Z moe/activation_test.py:115: in fn 2025-05-07T20:31:54.1892866Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.1893436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:54.1894015Z return fn(*args, **kwargs) 2025-05-07T20:31:54.1894701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:54.1895413Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:54.1895960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:54.1896675Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:54.1897364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:54.1897906Z kernel = self.compile( 2025-05-07T20:31:54.1898465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:54.1899144Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.1899558Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.1899794Z 2025-05-07T20:31:54.1900004Z self = 2025-05-07T20:31:54.1901172Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:54.1902732Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44767672e0>} 2025-05-07T20:31:54.1904131Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:54.1905182Z context = 2025-05-07T20:31:54.1905485Z 2025-05-07T20:31:54.1905653Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:54.1907263Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.1907754Z module_map=module_map) 2025-05-07T20:31:54.1908122Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.1908493Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.1908757Z E ^ 2025-05-07T20:31:54.1909235Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.1909708Z 2025-05-07T20:31:54.1910139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:54.1910680Z 2025-05-07T20:31:54.1910784Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:54.1911209Z self=, 2025-05-07T20:31:54.1911619Z T=1, 2025-05-07T20:31:54.1911808Z D=5120, 2025-05-07T20:31:54.1912012Z scale_ub=1200.0, 2025-05-07T20:31:54.1912238Z contiguous=False, 2025-05-07T20:31:54.1912472Z compiled=False, 2025-05-07T20:31:54.1912687Z ) 2025-05-07T20:31:54.1913009Z self = 2025-05-07T20:31:54.1913651Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:54.1913930Z 2025-05-07T20:31:54.1914009Z @given( 2025-05-07T20:31:54.1914248Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:54.1914568Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:54.1914883Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:54.1915223Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:54.1915557Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:54.1915852Z ) 2025-05-07T20:31:54.1916213Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:54.1916667Z def test_silu_mul_quant( 2025-05-07T20:31:54.1916911Z self, 2025-05-07T20:31:54.1917112Z T: int, 2025-05-07T20:31:54.1917317Z D: int, 2025-05-07T20:31:54.1917537Z scale_ub: Optional[float], 2025-05-07T20:31:54.1917813Z contiguous: bool, 2025-05-07T20:31:54.1918068Z compiled: bool, 2025-05-07T20:31:54.1918291Z ) -> None: 2025-05-07T20:31:54.1918511Z torch.manual_seed(2025) 2025-05-07T20:31:54.1918758Z 2025-05-07T20:31:54.1919033Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:54.1919385Z 2025-05-07T20:31:54.1919586Z x_sign = torch.sign(x) 2025-05-07T20:31:54.1919879Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:54.1920195Z x = x_sign * x_clamp 2025-05-07T20:31:54.1920448Z x0 = x[:, :D] 2025-05-07T20:31:54.1920665Z x1 = x[:, D:] 2025-05-07T20:31:54.1920875Z 2025-05-07T20:31:54.1921066Z if contiguous: 2025-05-07T20:31:54.1921299Z x0 = x0.contiguous() 2025-05-07T20:31:54.1921564Z x1 = x1.contiguous() 2025-05-07T20:31:54.1921810Z 2025-05-07T20:31:54.1921999Z if scale_ub is not None: 2025-05-07T20:31:54.1922278Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:54.1922624Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:54.1922942Z ) 2025-05-07T20:31:54.1923136Z else: 2025-05-07T20:31:54.1923350Z scale_ub_tensor = None 2025-05-07T20:31:54.1923744Z 2025-05-07T20:31:54.1923978Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:54.1924304Z op = silu_mul_quant 2025-05-07T20:31:54.1924561Z if compiled: 2025-05-07T20:31:54.1924806Z op = torch.compile(op) 2025-05-07T20:31:54.1925111Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.1925394Z 2025-05-07T20:31:54.1925587Z > y_fp8, y_scale = fn() 2025-05-07T20:31:54.1925761Z 2025-05-07T20:31:54.1925861Z moe/activation_test.py:117: 2025-05-07T20:31:54.1926168Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.1926508Z moe/activation_test.py:115: in fn 2025-05-07T20:31:54.1926790Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.1927506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:54.1928217Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:54.1928770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:54.1929478Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:54.1930165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:54.1930772Z kernel = self.compile( 2025-05-07T20:31:54.1931328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:54.1932104Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.1932513Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.1932839Z 2025-05-07T20:31:54.1933050Z self = 2025-05-07T20:31:54.1934176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:54.1935600Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44762d6ca0>} 2025-05-07T20:31:54.1936989Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:54.1938052Z context = 2025-05-07T20:31:54.1938351Z 2025-05-07T20:31:54.1938529Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:54.1939069Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.1939557Z module_map=module_map) 2025-05-07T20:31:54.1939930Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.1940286Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.1940554Z E ^ 2025-05-07T20:31:54.1941035Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.1941500Z 2025-05-07T20:31:54.1941928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:54.1942467Z 2025-05-07T20:31:54.1942573Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:54.1943001Z self=, 2025-05-07T20:31:54.1943424Z T=16384, 2025-05-07T20:31:54.1943619Z D=5120, 2025-05-07T20:31:54.1943820Z scale_ub=1200.0, 2025-05-07T20:31:54.1944053Z contiguous=False, 2025-05-07T20:31:54.1944279Z compiled=True, 2025-05-07T20:31:54.1944491Z ) 2025-05-07T20:31:54.2815170Z self = 2025-05-07T20:31:54.2815746Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:54.2816041Z 2025-05-07T20:31:54.2816122Z @given( 2025-05-07T20:31:54.2816351Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:54.2816666Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:54.2816974Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:54.2817303Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:54.2817632Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:54.2817919Z ) 2025-05-07T20:31:54.2818265Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:54.2818725Z def test_silu_mul_quant( 2025-05-07T20:31:54.2818968Z self, 2025-05-07T20:31:54.2819160Z T: int, 2025-05-07T20:31:54.2819355Z D: int, 2025-05-07T20:31:54.2819580Z scale_ub: Optional[float], 2025-05-07T20:31:54.2819845Z contiguous: bool, 2025-05-07T20:31:54.2820083Z compiled: bool, 2025-05-07T20:31:54.2820317Z ) -> None: 2025-05-07T20:31:54.2820532Z torch.manual_seed(2025) 2025-05-07T20:31:54.2820778Z 2025-05-07T20:31:54.2821056Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:54.2821408Z 2025-05-07T20:31:54.2821599Z x_sign = torch.sign(x) 2025-05-07T20:31:54.2821896Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:54.2822215Z x = x_sign * x_clamp 2025-05-07T20:31:54.2822455Z x0 = x[:, :D] 2025-05-07T20:31:54.2822681Z x1 = x[:, D:] 2025-05-07T20:31:54.2822891Z 2025-05-07T20:31:54.2823226Z if contiguous: 2025-05-07T20:31:54.2823463Z x0 = x0.contiguous() 2025-05-07T20:31:54.2823724Z x1 = x1.contiguous() 2025-05-07T20:31:54.2823962Z 2025-05-07T20:31:54.2824163Z if scale_ub is not None: 2025-05-07T20:31:54.2824441Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:54.2824781Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:54.2825103Z ) 2025-05-07T20:31:54.2825303Z else: 2025-05-07T20:31:54.2825515Z scale_ub_tensor = None 2025-05-07T20:31:54.2825772Z 2025-05-07T20:31:54.2826011Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:54.2826329Z op = silu_mul_quant 2025-05-07T20:31:54.2826591Z if compiled: 2025-05-07T20:31:54.2826844Z op = torch.compile(op) 2025-05-07T20:31:54.2827148Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.2827419Z 2025-05-07T20:31:54.2827613Z > y_fp8, y_scale = fn() 2025-05-07T20:31:54.2827786Z 2025-05-07T20:31:54.2827892Z moe/activation_test.py:117: 2025-05-07T20:31:54.2828188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.2828534Z moe/activation_test.py:115: in fn 2025-05-07T20:31:54.2828820Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.2829394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:54.2829973Z return fn(*args, **kwargs) 2025-05-07T20:31:54.2830653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:54.2831364Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:54.2831912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:54.2832617Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:54.2833310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:54.2833863Z kernel = self.compile( 2025-05-07T20:31:54.2834500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:54.2835185Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.2835597Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.2835837Z 2025-05-07T20:31:54.2836052Z self = 2025-05-07T20:31:54.2837177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:54.2838629Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44764ef380>} 2025-05-07T20:31:54.2840044Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:54.2841111Z context = 2025-05-07T20:31:54.2841409Z 2025-05-07T20:31:54.2841581Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:54.2842120Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.2842604Z module_map=module_map) 2025-05-07T20:31:54.2842971Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.2843337Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.2843622Z E ^ 2025-05-07T20:31:54.2844226Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.2844695Z 2025-05-07T20:31:54.2845130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:54.2845672Z 2025-05-07T20:31:54.2845779Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:54.2846207Z self=, 2025-05-07T20:31:54.2846624Z T=2048, 2025-05-07T20:31:54.2846845Z D=7168, 2025-05-07T20:31:54.2856842Z scale_ub=1200.0, 2025-05-07T20:31:54.2857133Z contiguous=False, 2025-05-07T20:31:54.2857371Z compiled=True, 2025-05-07T20:31:54.2857578Z ) 2025-05-07T20:31:54.2857911Z self = 2025-05-07T20:31:54.2858429Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:54.2858729Z 2025-05-07T20:31:54.2858811Z @given( 2025-05-07T20:31:54.2859052Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:54.2859371Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:54.2859682Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:54.2860022Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:54.2860361Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:54.2860694Z ) 2025-05-07T20:31:54.2861055Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:54.2861510Z def test_silu_mul_quant( 2025-05-07T20:31:54.2861761Z self, 2025-05-07T20:31:54.2861952Z T: int, 2025-05-07T20:31:54.2862155Z D: int, 2025-05-07T20:31:54.2862377Z scale_ub: Optional[float], 2025-05-07T20:31:54.2862648Z contiguous: bool, 2025-05-07T20:31:54.2862896Z compiled: bool, 2025-05-07T20:31:54.2863126Z ) -> None: 2025-05-07T20:31:54.2863347Z torch.manual_seed(2025) 2025-05-07T20:31:54.2863595Z 2025-05-07T20:31:54.2863875Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:54.2864222Z 2025-05-07T20:31:54.2864550Z x_sign = torch.sign(x) 2025-05-07T20:31:54.2864848Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:54.2865168Z x = x_sign * x_clamp 2025-05-07T20:31:54.2865407Z x0 = x[:, :D] 2025-05-07T20:31:54.2865629Z x1 = x[:, D:] 2025-05-07T20:31:54.2865839Z 2025-05-07T20:31:54.2866023Z if contiguous: 2025-05-07T20:31:54.2866254Z x0 = x0.contiguous() 2025-05-07T20:31:54.2866513Z x1 = x1.contiguous() 2025-05-07T20:31:54.2866823Z 2025-05-07T20:31:54.2867089Z if scale_ub is not None: 2025-05-07T20:31:54.2867418Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:54.2867750Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:54.2868065Z ) 2025-05-07T20:31:54.2868270Z else: 2025-05-07T20:31:54.2868478Z scale_ub_tensor = None 2025-05-07T20:31:54.2868736Z 2025-05-07T20:31:54.2868973Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:54.2869293Z op = silu_mul_quant 2025-05-07T20:31:54.2869548Z if compiled: 2025-05-07T20:31:54.2869800Z op = torch.compile(op) 2025-05-07T20:31:54.2870102Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.2870375Z 2025-05-07T20:31:54.2870573Z > y_fp8, y_scale = fn() 2025-05-07T20:31:54.2870741Z 2025-05-07T20:31:54.2870850Z moe/activation_test.py:117: 2025-05-07T20:31:54.2871148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.2871490Z moe/activation_test.py:115: in fn 2025-05-07T20:31:54.2871782Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.2872353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:54.2873093Z return fn(*args, **kwargs) 2025-05-07T20:31:54.2873773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:54.2874492Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:54.2875046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:54.2875753Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:54.2876442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:54.2876987Z kernel = self.compile( 2025-05-07T20:31:54.2877546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:54.2878398Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.2878917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.2879161Z 2025-05-07T20:31:54.2879376Z self = 2025-05-07T20:31:54.2880511Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:54.2881947Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44764efd80>} 2025-05-07T20:31:54.2883356Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:54.2884420Z context = 2025-05-07T20:31:54.2884722Z 2025-05-07T20:31:54.2884893Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:54.2885540Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.2886028Z module_map=module_map) 2025-05-07T20:31:54.2886398Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.2886770Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.2887038Z E ^ 2025-05-07T20:31:54.2887520Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.2887987Z 2025-05-07T20:31:54.2888418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:54.2888960Z 2025-05-07T20:31:54.4032411Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:54.4032906Z self=, 2025-05-07T20:31:54.4033484Z T=1, 2025-05-07T20:31:54.4033678Z D=5120, 2025-05-07T20:31:54.4033871Z scale_ub=None, 2025-05-07T20:31:54.4034092Z contiguous=False, 2025-05-07T20:31:54.4034332Z compiled=False, 2025-05-07T20:31:54.4034541Z ) 2025-05-07T20:31:54.4034877Z self = 2025-05-07T20:31:54.4035382Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:54.4035653Z 2025-05-07T20:31:54.4035735Z @given( 2025-05-07T20:31:54.4035976Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:54.4036303Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:54.4036615Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:54.4036958Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:54.4037295Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:54.4037587Z ) 2025-05-07T20:31:54.4038284Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:54.4038740Z def test_silu_mul_quant( 2025-05-07T20:31:54.4038988Z self, 2025-05-07T20:31:54.4039182Z T: int, 2025-05-07T20:31:54.4039389Z D: int, 2025-05-07T20:31:54.4039616Z scale_ub: Optional[float], 2025-05-07T20:31:54.4039887Z contiguous: bool, 2025-05-07T20:31:54.4040133Z compiled: bool, 2025-05-07T20:31:54.4040367Z ) -> None: 2025-05-07T20:31:54.4040582Z torch.manual_seed(2025) 2025-05-07T20:31:54.4040831Z 2025-05-07T20:31:54.4041110Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:54.4041455Z 2025-05-07T20:31:54.4041655Z x_sign = torch.sign(x) 2025-05-07T20:31:54.4041953Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:54.4042265Z x = x_sign * x_clamp 2025-05-07T20:31:54.4042515Z x0 = x[:, :D] 2025-05-07T20:31:54.4042749Z x1 = x[:, D:] 2025-05-07T20:31:54.4042962Z 2025-05-07T20:31:54.4043148Z if contiguous: 2025-05-07T20:31:54.4043386Z x0 = x0.contiguous() 2025-05-07T20:31:54.4043650Z x1 = x1.contiguous() 2025-05-07T20:31:54.4043896Z 2025-05-07T20:31:54.4044092Z if scale_ub is not None: 2025-05-07T20:31:54.4044371Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:54.4044709Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:54.4045030Z ) 2025-05-07T20:31:54.4045230Z else: 2025-05-07T20:31:54.4045443Z scale_ub_tensor = None 2025-05-07T20:31:54.4045701Z 2025-05-07T20:31:54.4045942Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:54.4046263Z op = silu_mul_quant 2025-05-07T20:31:54.4046522Z if compiled: 2025-05-07T20:31:54.4046775Z op = torch.compile(op) 2025-05-07T20:31:54.4047073Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.4047363Z 2025-05-07T20:31:54.4047569Z > y_fp8, y_scale = fn() 2025-05-07T20:31:54.4047737Z 2025-05-07T20:31:54.4047846Z moe/activation_test.py:117: 2025-05-07T20:31:54.4048324Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.4048671Z moe/activation_test.py:115: in fn 2025-05-07T20:31:54.4048966Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.4049674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:54.4050388Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:54.4050943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:54.4051649Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:54.4052426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:54.4052986Z kernel = self.compile( 2025-05-07T20:31:54.4053547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:54.4054227Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.4054642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.4054885Z 2025-05-07T20:31:54.4055096Z self = 2025-05-07T20:31:54.4056215Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:54.4057648Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4476edf4c0>} 2025-05-07T20:31:54.4059138Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:54.4060208Z context = 2025-05-07T20:31:54.4060503Z 2025-05-07T20:31:54.4060678Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:54.4061220Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.4061697Z module_map=module_map) 2025-05-07T20:31:54.4062069Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.4062435Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.4062698Z E ^ 2025-05-07T20:31:54.4063182Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.4063653Z 2025-05-07T20:31:54.4064089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:54.4064621Z 2025-05-07T20:31:54.4064736Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:54.4065159Z self=, 2025-05-07T20:31:54.4065574Z T=4096, 2025-05-07T20:31:54.4065769Z D=7168, 2025-05-07T20:31:54.4065960Z scale_ub=1200.0, 2025-05-07T20:31:54.4066193Z contiguous=False, 2025-05-07T20:31:54.4066426Z compiled=False, 2025-05-07T20:31:54.4066630Z ) 2025-05-07T20:31:54.4066960Z self = 2025-05-07T20:31:54.4067482Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:54.4067768Z 2025-05-07T20:31:54.4067852Z @given( 2025-05-07T20:31:54.4068083Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:54.4068413Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:54.4068729Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:54.4069063Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:54.4069485Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:54.4069780Z ) 2025-05-07T20:31:54.4070133Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:54.4070621Z def test_silu_mul_quant( 2025-05-07T20:31:54.4070884Z self, 2025-05-07T20:31:54.4071077Z T: int, 2025-05-07T20:31:54.4071279Z D: int, 2025-05-07T20:31:54.4071518Z scale_ub: Optional[float], 2025-05-07T20:31:54.4071796Z contiguous: bool, 2025-05-07T20:31:54.4072036Z compiled: bool, 2025-05-07T20:31:54.4072265Z ) -> None: 2025-05-07T20:31:54.4072486Z torch.manual_seed(2025) 2025-05-07T20:31:54.4072728Z 2025-05-07T20:31:54.4073006Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:54.4073364Z 2025-05-07T20:31:54.4073563Z x_sign = torch.sign(x) 2025-05-07T20:31:54.4073856Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:54.4074181Z x = x_sign * x_clamp 2025-05-07T20:31:54.4074425Z x0 = x[:, :D] 2025-05-07T20:31:54.4074640Z x1 = x[:, D:] 2025-05-07T20:31:54.4074852Z 2025-05-07T20:31:54.4075042Z if contiguous: 2025-05-07T20:31:54.4075271Z x0 = x0.contiguous() 2025-05-07T20:31:54.4075535Z x1 = x1.contiguous() 2025-05-07T20:31:54.4075781Z 2025-05-07T20:31:54.4075987Z if scale_ub is not None: 2025-05-07T20:31:54.4076269Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:54.4076610Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:54.4076924Z ) 2025-05-07T20:31:54.4077113Z else: 2025-05-07T20:31:54.4077324Z scale_ub_tensor = None 2025-05-07T20:31:54.4077694Z 2025-05-07T20:31:54.4077927Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:54.4078250Z op = silu_mul_quant 2025-05-07T20:31:54.4078506Z if compiled: 2025-05-07T20:31:54.4078756Z op = torch.compile(op) 2025-05-07T20:31:54.4079060Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.4079341Z 2025-05-07T20:31:54.4079528Z > y_fp8, y_scale = fn() 2025-05-07T20:31:54.4079699Z 2025-05-07T20:31:54.4079798Z moe/activation_test.py:117: 2025-05-07T20:31:54.4080104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.4080453Z moe/activation_test.py:115: in fn 2025-05-07T20:31:54.4080777Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.4081499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:54.4082211Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:54.4082765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:54.4083468Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:54.4084162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:54.4084717Z kernel = self.compile( 2025-05-07T20:31:54.4085267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:54.4085948Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.4086361Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.4086599Z 2025-05-07T20:31:54.4086809Z self = 2025-05-07T20:31:54.4087925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:54.4089440Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4476b58540>} 2025-05-07T20:31:54.4090892Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:54.4092049Z context = 2025-05-07T20:31:54.4092346Z 2025-05-07T20:31:54.4092517Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:54.4093061Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.4093546Z module_map=module_map) 2025-05-07T20:31:54.4093926Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.4094281Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.4094549Z E ^ 2025-05-07T20:31:54.4095035Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.4095501Z 2025-05-07T20:31:54.4095933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:54.4096472Z 2025-05-07T20:31:54.4096576Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:54.4097005Z self=, 2025-05-07T20:31:54.4097428Z T=16384, 2025-05-07T20:31:54.4097620Z D=7168, 2025-05-07T20:31:54.4097820Z scale_ub=None, 2025-05-07T20:31:54.4098040Z contiguous=True, 2025-05-07T20:31:54.4098262Z compiled=True, 2025-05-07T20:31:54.4098471Z ) 2025-05-07T20:31:54.5863186Z self = 2025-05-07T20:31:54.5863992Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:54.5864385Z 2025-05-07T20:31:54.5864493Z @given( 2025-05-07T20:31:54.5864757Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:54.5865087Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:54.5865405Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:54.5865739Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:54.5866079Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:54.5866369Z ) 2025-05-07T20:31:54.5866724Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:54.5867190Z def test_silu_mul_quant( 2025-05-07T20:31:54.5867444Z self, 2025-05-07T20:31:54.5867640Z T: int, 2025-05-07T20:31:54.5867851Z D: int, 2025-05-07T20:31:54.5868072Z scale_ub: Optional[float], 2025-05-07T20:31:54.5868353Z contiguous: bool, 2025-05-07T20:31:54.5868600Z compiled: bool, 2025-05-07T20:31:54.5868834Z ) -> None: 2025-05-07T20:31:54.5869047Z torch.manual_seed(2025) 2025-05-07T20:31:54.5869310Z 2025-05-07T20:31:54.5869595Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:54.5869949Z 2025-05-07T20:31:54.5870143Z x_sign = torch.sign(x) 2025-05-07T20:31:54.5870450Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:54.5870773Z x = x_sign * x_clamp 2025-05-07T20:31:54.5871014Z x0 = x[:, :D] 2025-05-07T20:31:54.5871239Z x1 = x[:, D:] 2025-05-07T20:31:54.5871456Z 2025-05-07T20:31:54.5871640Z if contiguous: 2025-05-07T20:31:54.5871911Z x0 = x0.contiguous() 2025-05-07T20:31:54.5872265Z x1 = x1.contiguous() 2025-05-07T20:31:54.5872508Z 2025-05-07T20:31:54.5872706Z if scale_ub is not None: 2025-05-07T20:31:54.5872997Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:54.5873338Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:54.5873658Z ) 2025-05-07T20:31:54.5873860Z else: 2025-05-07T20:31:54.5874444Z scale_ub_tensor = None 2025-05-07T20:31:54.5874716Z 2025-05-07T20:31:54.5874964Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:54.5875310Z op = silu_mul_quant 2025-05-07T20:31:54.5875569Z if compiled: 2025-05-07T20:31:54.5875826Z op = torch.compile(op) 2025-05-07T20:31:54.5876132Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.5876409Z 2025-05-07T20:31:54.5876613Z > y_fp8, y_scale = fn() 2025-05-07T20:31:54.5876784Z 2025-05-07T20:31:54.5876898Z moe/activation_test.py:117: 2025-05-07T20:31:54.5877201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.5877553Z moe/activation_test.py:115: in fn 2025-05-07T20:31:54.5877854Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.5878435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:54.5879035Z return fn(*args, **kwargs) 2025-05-07T20:31:54.5879729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:54.5880451Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:54.5881011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:54.5881724Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:54.5882420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:54.5883136Z kernel = self.compile( 2025-05-07T20:31:54.5883787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:54.5884667Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.5885089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.5885331Z 2025-05-07T20:31:54.5885546Z self = 2025-05-07T20:31:54.5886679Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:54.5888138Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4477b899e0>} 2025-05-07T20:31:54.5889544Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:54.5890620Z context = 2025-05-07T20:31:54.5890918Z 2025-05-07T20:31:54.5891095Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:54.5891646Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.5892238Z module_map=module_map) 2025-05-07T20:31:54.5892616Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.5892976Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.5893245Z E ^ 2025-05-07T20:31:54.5893732Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.5894202Z 2025-05-07T20:31:54.5894633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:54.5895180Z 2025-05-07T20:31:54.5895286Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:54.5895801Z self=, 2025-05-07T20:31:54.5896224Z T=4096, 2025-05-07T20:31:54.5896412Z D=5120, 2025-05-07T20:31:54.5896610Z scale_ub=None, 2025-05-07T20:31:54.5896835Z contiguous=False, 2025-05-07T20:31:54.5897063Z compiled=True, 2025-05-07T20:31:54.5897279Z ) 2025-05-07T20:31:54.5897613Z self = 2025-05-07T20:31:54.5898128Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:54.5898420Z 2025-05-07T20:31:54.5898501Z @given( 2025-05-07T20:31:54.5898743Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:54.5899065Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:54.5899383Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:54.5899733Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:54.5900079Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:54.5900377Z ) 2025-05-07T20:31:54.5900773Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:54.5901260Z def test_silu_mul_quant( 2025-05-07T20:31:54.5901506Z self, 2025-05-07T20:31:54.5901716Z T: int, 2025-05-07T20:31:54.5901927Z D: int, 2025-05-07T20:31:54.5902146Z scale_ub: Optional[float], 2025-05-07T20:31:54.5902430Z contiguous: bool, 2025-05-07T20:31:54.5902680Z compiled: bool, 2025-05-07T20:31:54.5902910Z ) -> None: 2025-05-07T20:31:54.5903139Z torch.manual_seed(2025) 2025-05-07T20:31:54.5903394Z 2025-05-07T20:31:54.5903671Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:54.5904029Z 2025-05-07T20:31:54.5904228Z x_sign = torch.sign(x) 2025-05-07T20:31:54.5904618Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:54.5904936Z x = x_sign * x_clamp 2025-05-07T20:31:54.5905185Z x0 = x[:, :D] 2025-05-07T20:31:54.5905413Z x1 = x[:, D:] 2025-05-07T20:31:54.5905625Z 2025-05-07T20:31:54.5905835Z if contiguous: 2025-05-07T20:31:54.5906076Z x0 = x0.contiguous() 2025-05-07T20:31:54.5906696Z x1 = x1.contiguous() 2025-05-07T20:31:54.5906947Z 2025-05-07T20:31:54.5907147Z if scale_ub is not None: 2025-05-07T20:31:54.5907431Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:54.5907772Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:54.5916728Z ) 2025-05-07T20:31:54.5916950Z else: 2025-05-07T20:31:54.5917167Z scale_ub_tensor = None 2025-05-07T20:31:54.5917426Z 2025-05-07T20:31:54.5917661Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:54.5917987Z op = silu_mul_quant 2025-05-07T20:31:54.5918269Z if compiled: 2025-05-07T20:31:54.5918522Z op = torch.compile(op) 2025-05-07T20:31:54.5918832Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.5919122Z 2025-05-07T20:31:54.5919333Z > y_fp8, y_scale = fn() 2025-05-07T20:31:54.5919505Z 2025-05-07T20:31:54.5919611Z moe/activation_test.py:117: 2025-05-07T20:31:54.5919928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.5920284Z moe/activation_test.py:115: in fn 2025-05-07T20:31:54.5920575Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.5921163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:54.5921749Z return fn(*args, **kwargs) 2025-05-07T20:31:54.5922431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:54.5923157Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:54.5923724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:54.5924617Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:54.5925315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:54.5925867Z kernel = self.compile( 2025-05-07T20:31:54.5926438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:54.5927116Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.5927532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.5927771Z 2025-05-07T20:31:54.5927990Z self = 2025-05-07T20:31:54.5929127Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:54.5930565Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4477b5fba0>} 2025-05-07T20:31:54.5932048Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:54.5933121Z context = 2025-05-07T20:31:54.5933420Z 2025-05-07T20:31:54.5933600Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:54.5934142Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.5934772Z module_map=module_map) 2025-05-07T20:31:54.5935150Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.5935519Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.5935790Z E ^ 2025-05-07T20:31:54.5936275Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.5936743Z 2025-05-07T20:31:54.5937181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:54.5937714Z 2025-05-07T20:31:54.7390966Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:54.7391653Z self=, 2025-05-07T20:31:54.7392223Z T=4096, 2025-05-07T20:31:54.7392456Z D=5120, 2025-05-07T20:31:54.7392651Z scale_ub=1200.0, 2025-05-07T20:31:54.7392879Z contiguous=False, 2025-05-07T20:31:54.7393134Z compiled=False, 2025-05-07T20:31:54.7393345Z ) 2025-05-07T20:31:54.7393677Z self = 2025-05-07T20:31:54.7394191Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:54.7394497Z 2025-05-07T20:31:54.7394578Z @given( 2025-05-07T20:31:54.7394820Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:54.7395136Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:54.7395450Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:54.7395791Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:54.7396128Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:54.7396415Z ) 2025-05-07T20:31:54.7396777Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:54.7397232Z def test_silu_mul_quant( 2025-05-07T20:31:54.7397473Z self, 2025-05-07T20:31:54.7397673Z T: int, 2025-05-07T20:31:54.7397880Z D: int, 2025-05-07T20:31:54.7398099Z scale_ub: Optional[float], 2025-05-07T20:31:54.7398380Z contiguous: bool, 2025-05-07T20:31:54.7398625Z compiled: bool, 2025-05-07T20:31:54.7398852Z ) -> None: 2025-05-07T20:31:54.7399439Z torch.manual_seed(2025) 2025-05-07T20:31:54.7399702Z 2025-05-07T20:31:54.7399976Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:54.7400328Z 2025-05-07T20:31:54.7400531Z x_sign = torch.sign(x) 2025-05-07T20:31:54.7400826Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:54.7401148Z x = x_sign * x_clamp 2025-05-07T20:31:54.7401399Z x0 = x[:, :D] 2025-05-07T20:31:54.7401628Z x1 = x[:, D:] 2025-05-07T20:31:54.7401839Z 2025-05-07T20:31:54.7402033Z if contiguous: 2025-05-07T20:31:54.7402274Z x0 = x0.contiguous() 2025-05-07T20:31:54.7402536Z x1 = x1.contiguous() 2025-05-07T20:31:54.7402792Z 2025-05-07T20:31:54.7402988Z if scale_ub is not None: 2025-05-07T20:31:54.7403265Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:54.7403610Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:54.7403935Z ) 2025-05-07T20:31:54.7404132Z else: 2025-05-07T20:31:54.7404356Z scale_ub_tensor = None 2025-05-07T20:31:54.7404618Z 2025-05-07T20:31:54.7404854Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:54.7405182Z op = silu_mul_quant 2025-05-07T20:31:54.7405441Z if compiled: 2025-05-07T20:31:54.7405691Z op = torch.compile(op) 2025-05-07T20:31:54.7405995Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.7406594Z 2025-05-07T20:31:54.7406791Z > y_fp8, y_scale = fn() 2025-05-07T20:31:54.7406970Z 2025-05-07T20:31:54.7407072Z moe/activation_test.py:117: 2025-05-07T20:31:54.7407379Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.7407889Z moe/activation_test.py:115: in fn 2025-05-07T20:31:54.7408175Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.7408896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:54.7409614Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:54.7410167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:54.7410875Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:54.7411566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:54.7412216Z kernel = self.compile( 2025-05-07T20:31:54.7412773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:54.7413459Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.7413878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.7414114Z 2025-05-07T20:31:54.7414338Z self = 2025-05-07T20:31:54.7415457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:54.7416904Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4477b5d440>} 2025-05-07T20:31:54.7418299Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:54.7419366Z context = 2025-05-07T20:31:54.7419663Z 2025-05-07T20:31:54.7419843Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:54.7420505Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.7420994Z module_map=module_map) 2025-05-07T20:31:54.7421372Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.7421731Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.7422000Z E ^ 2025-05-07T20:31:54.7422482Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.7422948Z 2025-05-07T20:31:54.7423383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:54.7423916Z 2025-05-07T20:31:54.7424023Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:54.7424458Z self=, 2025-05-07T20:31:54.7424879Z T=4096, 2025-05-07T20:31:54.7425068Z D=5120, 2025-05-07T20:31:54.7425274Z scale_ub=1200.0, 2025-05-07T20:31:54.7425507Z contiguous=False, 2025-05-07T20:31:54.7425736Z compiled=True, 2025-05-07T20:31:54.7425947Z ) 2025-05-07T20:31:54.7426278Z self = 2025-05-07T20:31:54.7426789Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:54.7427080Z 2025-05-07T20:31:54.7427160Z @given( 2025-05-07T20:31:54.7427403Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:54.7427730Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:54.7428042Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:54.7428382Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:54.7428811Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:54.7429099Z ) 2025-05-07T20:31:54.7429457Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:54.7429913Z def test_silu_mul_quant( 2025-05-07T20:31:54.7430159Z self, 2025-05-07T20:31:54.7430360Z T: int, 2025-05-07T20:31:54.7430566Z D: int, 2025-05-07T20:31:54.7430808Z scale_ub: Optional[float], 2025-05-07T20:31:54.7431112Z contiguous: bool, 2025-05-07T20:31:54.7431361Z compiled: bool, 2025-05-07T20:31:54.7431591Z ) -> None: 2025-05-07T20:31:54.7431809Z torch.manual_seed(2025) 2025-05-07T20:31:54.7432058Z 2025-05-07T20:31:54.7432338Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:54.7432683Z 2025-05-07T20:31:54.7432882Z x_sign = torch.sign(x) 2025-05-07T20:31:54.7433179Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:54.7433493Z x = x_sign * x_clamp 2025-05-07T20:31:54.7433746Z x0 = x[:, :D] 2025-05-07T20:31:54.7433968Z x1 = x[:, D:] 2025-05-07T20:31:54.7434176Z 2025-05-07T20:31:54.7434366Z if contiguous: 2025-05-07T20:31:54.7434603Z x0 = x0.contiguous() 2025-05-07T20:31:54.7434868Z x1 = x1.contiguous() 2025-05-07T20:31:54.7435113Z 2025-05-07T20:31:54.7435310Z if scale_ub is not None: 2025-05-07T20:31:54.7435586Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:54.7435929Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:54.7436247Z ) 2025-05-07T20:31:54.7436447Z else: 2025-05-07T20:31:54.7436659Z scale_ub_tensor = None 2025-05-07T20:31:54.7436916Z 2025-05-07T20:31:54.7437155Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:54.7437478Z op = silu_mul_quant 2025-05-07T20:31:54.7437736Z if compiled: 2025-05-07T20:31:54.7437991Z op = torch.compile(op) 2025-05-07T20:31:54.7438298Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.7438583Z 2025-05-07T20:31:54.7438785Z > y_fp8, y_scale = fn() 2025-05-07T20:31:54.7438955Z 2025-05-07T20:31:54.7439142Z moe/activation_test.py:117: 2025-05-07T20:31:54.7439452Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.7439795Z moe/activation_test.py:115: in fn 2025-05-07T20:31:54.7440080Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.7440660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:54.7441287Z return fn(*args, **kwargs) 2025-05-07T20:31:54.7441974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:54.7442682Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:54.7443240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:54.7443952Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:54.7444642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:54.7445186Z kernel = self.compile( 2025-05-07T20:31:54.7445742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:54.7446437Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.7446851Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.7447108Z 2025-05-07T20:31:54.7447330Z self = 2025-05-07T20:31:54.7448445Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:54.7449997Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44779ab920>} 2025-05-07T20:31:54.7451397Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:54.7452567Z context = 2025-05-07T20:31:54.7452865Z 2025-05-07T20:31:54.7453038Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:54.7453584Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.7454072Z module_map=module_map) 2025-05-07T20:31:54.7454451Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.7454816Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.7455087Z E ^ 2025-05-07T20:31:54.7455575Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.7456042Z 2025-05-07T20:31:54.7456478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:54.7457018Z 2025-05-07T20:31:54.8593287Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:54.8593967Z self=, 2025-05-07T20:31:54.8594539Z T=2048, 2025-05-07T20:31:54.8594808Z D=7168, 2025-05-07T20:31:54.8595064Z scale_ub=1200.0, 2025-05-07T20:31:54.8595355Z contiguous=False, 2025-05-07T20:31:54.8595648Z compiled=False, 2025-05-07T20:31:54.8595917Z ) 2025-05-07T20:31:54.8596321Z self = 2025-05-07T20:31:54.8596872Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:54.8597162Z 2025-05-07T20:31:54.8597250Z @given( 2025-05-07T20:31:54.8597829Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:54.8598160Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:54.8598477Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:54.8598819Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:54.8599152Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:54.8599446Z ) 2025-05-07T20:31:54.8599807Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:54.8600259Z def test_silu_mul_quant( 2025-05-07T20:31:54.8600508Z self, 2025-05-07T20:31:54.8600707Z T: int, 2025-05-07T20:31:54.8600904Z D: int, 2025-05-07T20:31:54.8601129Z scale_ub: Optional[float], 2025-05-07T20:31:54.8601415Z contiguous: bool, 2025-05-07T20:31:54.8601658Z compiled: bool, 2025-05-07T20:31:54.8601898Z ) -> None: 2025-05-07T20:31:54.8602123Z torch.manual_seed(2025) 2025-05-07T20:31:54.8602372Z 2025-05-07T20:31:54.8602660Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:54.8603018Z 2025-05-07T20:31:54.8603216Z x_sign = torch.sign(x) 2025-05-07T20:31:54.8603520Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:54.8603845Z x = x_sign * x_clamp 2025-05-07T20:31:54.8604098Z x0 = x[:, :D] 2025-05-07T20:31:54.8604315Z x1 = x[:, D:] 2025-05-07T20:31:54.8604532Z 2025-05-07T20:31:54.8604726Z if contiguous: 2025-05-07T20:31:54.8604959Z x0 = x0.contiguous() 2025-05-07T20:31:54.8605230Z x1 = x1.contiguous() 2025-05-07T20:31:54.8605478Z 2025-05-07T20:31:54.8605668Z if scale_ub is not None: 2025-05-07T20:31:54.8605948Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:54.8606764Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:54.8607076Z ) 2025-05-07T20:31:54.8607277Z else: 2025-05-07T20:31:54.8607496Z scale_ub_tensor = None 2025-05-07T20:31:54.8607759Z 2025-05-07T20:31:54.8607996Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:54.8608322Z op = silu_mul_quant 2025-05-07T20:31:54.8608574Z if compiled: 2025-05-07T20:31:54.8608832Z op = torch.compile(op) 2025-05-07T20:31:54.8609137Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.8609418Z 2025-05-07T20:31:54.8609612Z > y_fp8, y_scale = fn() 2025-05-07T20:31:54.8609788Z 2025-05-07T20:31:54.8609889Z moe/activation_test.py:117: 2025-05-07T20:31:54.8610196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.8610536Z moe/activation_test.py:115: in fn 2025-05-07T20:31:54.8610835Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.8611554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:54.8612353Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:54.8612916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:54.8613624Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:54.8614318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:54.8614866Z kernel = self.compile( 2025-05-07T20:31:54.8615429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:54.8616111Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.8616526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.8616775Z 2025-05-07T20:31:54.8616986Z self = 2025-05-07T20:31:54.8618240Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:54.8619685Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44840996c0>} 2025-05-07T20:31:54.8621081Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:54.8622139Z context = 2025-05-07T20:31:54.8622451Z 2025-05-07T20:31:54.8622623Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:54.8623168Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.8623657Z module_map=module_map) 2025-05-07T20:31:54.8624027Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.8624390Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.8624660Z E ^ 2025-05-07T20:31:54.8625137Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.8625612Z 2025-05-07T20:31:54.8626042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:54.8626582Z 2025-05-07T20:31:54.8626687Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:54.8627119Z self=, 2025-05-07T20:31:54.8627659Z T=1, 2025-05-07T20:31:54.8627852Z D=7168, 2025-05-07T20:31:54.8628058Z scale_ub=None, 2025-05-07T20:31:54.8628278Z contiguous=True, 2025-05-07T20:31:54.8628509Z compiled=False, 2025-05-07T20:31:54.8628729Z ) 2025-05-07T20:31:54.8629053Z self = 2025-05-07T20:31:54.8629561Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:54.8629840Z 2025-05-07T20:31:54.8629920Z @given( 2025-05-07T20:31:54.8630158Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:54.8630479Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:54.8630813Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:54.8631188Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:54.8631523Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:54.8631816Z ) 2025-05-07T20:31:54.8632175Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:54.8632631Z def test_silu_mul_quant( 2025-05-07T20:31:54.8632881Z self, 2025-05-07T20:31:54.8633084Z T: int, 2025-05-07T20:31:54.8633288Z D: int, 2025-05-07T20:31:54.8633513Z scale_ub: Optional[float], 2025-05-07T20:31:54.8633792Z contiguous: bool, 2025-05-07T20:31:54.8634038Z compiled: bool, 2025-05-07T20:31:54.8634263Z ) -> None: 2025-05-07T20:31:54.8634484Z torch.manual_seed(2025) 2025-05-07T20:31:54.8634734Z 2025-05-07T20:31:54.8635008Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:54.8635358Z 2025-05-07T20:31:54.8635557Z x_sign = torch.sign(x) 2025-05-07T20:31:54.8635853Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:54.8636176Z x = x_sign * x_clamp 2025-05-07T20:31:54.8636424Z x0 = x[:, :D] 2025-05-07T20:31:54.8636639Z x1 = x[:, D:] 2025-05-07T20:31:54.8636858Z 2025-05-07T20:31:54.8637051Z if contiguous: 2025-05-07T20:31:54.8637286Z x0 = x0.contiguous() 2025-05-07T20:31:54.8637550Z x1 = x1.contiguous() 2025-05-07T20:31:54.8637795Z 2025-05-07T20:31:54.8638077Z if scale_ub is not None: 2025-05-07T20:31:54.8638357Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:54.8638703Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:54.8639020Z ) 2025-05-07T20:31:54.8639212Z else: 2025-05-07T20:31:54.8639432Z scale_ub_tensor = None 2025-05-07T20:31:54.8639690Z 2025-05-07T20:31:54.8639921Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:54.8640246Z op = silu_mul_quant 2025-05-07T20:31:54.8640506Z if compiled: 2025-05-07T20:31:54.8640754Z op = torch.compile(op) 2025-05-07T20:31:54.8641086Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.8641396Z 2025-05-07T20:31:54.8641593Z > y_fp8, y_scale = fn() 2025-05-07T20:31:54.8641766Z 2025-05-07T20:31:54.8641866Z moe/activation_test.py:117: 2025-05-07T20:31:54.8642172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.8642526Z moe/activation_test.py:115: in fn 2025-05-07T20:31:54.8642812Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.8643534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:54.8644249Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:54.8644801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:54.8645511Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:54.8646206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:54.8646851Z kernel = self.compile( 2025-05-07T20:31:54.8647411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:54.8648096Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.8648525Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.8648763Z 2025-05-07T20:31:54.8648980Z self = 2025-05-07T20:31:54.8650098Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:54.8651521Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4484503c40>} 2025-05-07T20:31:54.8653025Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:54.8654123Z context = 2025-05-07T20:31:54.8654424Z 2025-05-07T20:31:54.8654596Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:54.8655148Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.8655643Z module_map=module_map) 2025-05-07T20:31:54.8656023Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.8656384Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.8656653Z E ^ 2025-05-07T20:31:54.8657140Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.8657615Z 2025-05-07T20:31:54.8658069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:54.8658612Z 2025-05-07T20:31:54.8658718Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:54.8659258Z self=, 2025-05-07T20:31:54.8659684Z T=16384, 2025-05-07T20:31:54.8659877Z D=7168, 2025-05-07T20:31:54.8660077Z scale_ub=1200.0, 2025-05-07T20:31:54.8660314Z contiguous=False, 2025-05-07T20:31:54.8660543Z compiled=True, 2025-05-07T20:31:55.1069316Z ) 2025-05-07T20:31:55.1069879Z self = 2025-05-07T20:31:55.1070562Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:55.1070871Z 2025-05-07T20:31:55.1070968Z @given( 2025-05-07T20:31:55.1071226Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:55.1071536Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:55.1071875Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:55.1072212Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:55.1072537Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:55.1072843Z ) 2025-05-07T20:31:55.1073202Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:55.1073652Z def test_silu_mul_quant( 2025-05-07T20:31:55.1073907Z self, 2025-05-07T20:31:55.1074112Z T: int, 2025-05-07T20:31:55.1074318Z D: int, 2025-05-07T20:31:55.1074533Z scale_ub: Optional[float], 2025-05-07T20:31:55.1074820Z contiguous: bool, 2025-05-07T20:31:55.1075065Z compiled: bool, 2025-05-07T20:31:55.1075297Z ) -> None: 2025-05-07T20:31:55.1075516Z torch.manual_seed(2025) 2025-05-07T20:31:55.1075765Z 2025-05-07T20:31:55.1076037Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:55.1076744Z 2025-05-07T20:31:55.1076944Z x_sign = torch.sign(x) 2025-05-07T20:31:55.1077243Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:55.1077559Z x = x_sign * x_clamp 2025-05-07T20:31:55.1077809Z x0 = x[:, :D] 2025-05-07T20:31:55.1078035Z x1 = x[:, D:] 2025-05-07T20:31:55.1078242Z 2025-05-07T20:31:55.1078438Z if contiguous: 2025-05-07T20:31:55.1078677Z x0 = x0.contiguous() 2025-05-07T20:31:55.1078937Z x1 = x1.contiguous() 2025-05-07T20:31:55.1079179Z 2025-05-07T20:31:55.1079374Z if scale_ub is not None: 2025-05-07T20:31:55.1079646Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:55.1079994Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:55.1080311Z ) 2025-05-07T20:31:55.1080500Z else: 2025-05-07T20:31:55.1080711Z scale_ub_tensor = None 2025-05-07T20:31:55.1080969Z 2025-05-07T20:31:55.1081201Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:55.1081531Z op = silu_mul_quant 2025-05-07T20:31:55.1081789Z if compiled: 2025-05-07T20:31:55.1082040Z op = torch.compile(op) 2025-05-07T20:31:55.1082343Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.1082630Z 2025-05-07T20:31:55.1082829Z > y_fp8, y_scale = fn() 2025-05-07T20:31:55.1082996Z 2025-05-07T20:31:55.1083097Z moe/activation_test.py:117: 2025-05-07T20:31:55.1083404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.1083746Z moe/activation_test.py:115: in fn 2025-05-07T20:31:55.1084029Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.1084607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:55.1085186Z return fn(*args, **kwargs) 2025-05-07T20:31:55.1085867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:55.1086575Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:55.1087272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:55.1087976Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:55.1088659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:55.1089214Z kernel = self.compile( 2025-05-07T20:31:55.1089774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:55.1090451Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.1090864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.1091104Z 2025-05-07T20:31:55.1091314Z self = 2025-05-07T20:31:55.1092582Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:55.1094032Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4484a9da80>} 2025-05-07T20:31:55.1095426Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:55.1096492Z context = 2025-05-07T20:31:55.1096798Z 2025-05-07T20:31:55.1096969Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:55.1097515Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.1098083Z module_map=module_map) 2025-05-07T20:31:55.1098459Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.1098834Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:55.1099101Z E ^ 2025-05-07T20:31:55.1099574Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.1100044Z 2025-05-07T20:31:55.1100475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:55.1101056Z 2025-05-07T20:31:55.1101170Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:55.1101595Z self=, 2025-05-07T20:31:55.1102004Z T=1, 2025-05-07T20:31:55.1102194Z D=7168, 2025-05-07T20:31:55.1102393Z scale_ub=None, 2025-05-07T20:31:55.1102616Z contiguous=False, 2025-05-07T20:31:55.1102845Z compiled=False, 2025-05-07T20:31:55.1103049Z ) 2025-05-07T20:31:55.1103369Z self = 2025-05-07T20:31:55.1103884Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:55.1104155Z 2025-05-07T20:31:55.1104240Z @given( 2025-05-07T20:31:55.1104476Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:55.1104795Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:55.1105107Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:55.1105441Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:55.1105776Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:55.1106068Z ) 2025-05-07T20:31:55.1106824Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:55.1107275Z def test_silu_mul_quant( 2025-05-07T20:31:55.1107535Z self, 2025-05-07T20:31:55.1107733Z T: int, 2025-05-07T20:31:55.1107934Z D: int, 2025-05-07T20:31:55.1108162Z scale_ub: Optional[float], 2025-05-07T20:31:55.1108440Z contiguous: bool, 2025-05-07T20:31:55.1108822Z compiled: bool, 2025-05-07T20:31:55.1109058Z ) -> None: 2025-05-07T20:31:55.1109279Z torch.manual_seed(2025) 2025-05-07T20:31:55.1109520Z 2025-05-07T20:31:55.1109800Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:55.1110152Z 2025-05-07T20:31:55.1110344Z x_sign = torch.sign(x) 2025-05-07T20:31:55.1110639Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:55.1110959Z x = x_sign * x_clamp 2025-05-07T20:31:55.1111204Z x0 = x[:, :D] 2025-05-07T20:31:55.1111419Z x1 = x[:, D:] 2025-05-07T20:31:55.1111628Z 2025-05-07T20:31:55.1111816Z if contiguous: 2025-05-07T20:31:55.1112047Z x0 = x0.contiguous() 2025-05-07T20:31:55.1112320Z x1 = x1.contiguous() 2025-05-07T20:31:55.1112563Z 2025-05-07T20:31:55.1112750Z if scale_ub is not None: 2025-05-07T20:31:55.1113026Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:55.1113376Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:55.1113686Z ) 2025-05-07T20:31:55.1113887Z else: 2025-05-07T20:31:55.1114104Z scale_ub_tensor = None 2025-05-07T20:31:55.1114355Z 2025-05-07T20:31:55.1114593Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:55.1114916Z op = silu_mul_quant 2025-05-07T20:31:55.1115166Z if compiled: 2025-05-07T20:31:55.1115419Z op = torch.compile(op) 2025-05-07T20:31:55.1115723Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.1116001Z 2025-05-07T20:31:55.1116200Z > y_fp8, y_scale = fn() 2025-05-07T20:31:55.1116372Z 2025-05-07T20:31:55.1116473Z moe/activation_test.py:117: 2025-05-07T20:31:55.1116957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.1117297Z moe/activation_test.py:115: in fn 2025-05-07T20:31:55.1117591Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.1118326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:55.1119047Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:55.1119611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:55.1120329Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:55.1121058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:55.1121637Z kernel = self.compile( 2025-05-07T20:31:55.1122201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:55.1122895Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.1123305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.1123556Z 2025-05-07T20:31:55.1123770Z self = 2025-05-07T20:31:55.1124912Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:55.1126371Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4484b756c0>} 2025-05-07T20:31:55.1127792Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:55.1128874Z context = 2025-05-07T20:31:55.1129179Z 2025-05-07T20:31:55.1129432Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:55.1129984Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.1130471Z module_map=module_map) 2025-05-07T20:31:55.1130843Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.1131213Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:55.1131480Z E ^ 2025-05-07T20:31:55.1132040Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.1132524Z 2025-05-07T20:31:55.1132962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:55.1133525Z 2025-05-07T20:31:55.1133631Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:55.1134061Z self=, 2025-05-07T20:31:55.1134477Z T=2048, 2025-05-07T20:31:55.1134677Z D=7168, 2025-05-07T20:31:55.1134876Z scale_ub=None, 2025-05-07T20:31:55.1135093Z contiguous=False, 2025-05-07T20:31:55.1135328Z compiled=True, 2025-05-07T20:31:55.1135536Z ) 2025-05-07T20:31:55.2004950Z self = 2025-05-07T20:31:55.2006922Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:55.2007701Z 2025-05-07T20:31:55.2007911Z @given( 2025-05-07T20:31:55.2008472Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:55.2009096Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:55.2009698Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:55.2010346Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:55.2011128Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:55.2011416Z ) 2025-05-07T20:31:55.2011844Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:55.2012299Z def test_silu_mul_quant( 2025-05-07T20:31:55.2012545Z self, 2025-05-07T20:31:55.2012737Z T: int, 2025-05-07T20:31:55.2012923Z D: int, 2025-05-07T20:31:55.2013138Z scale_ub: Optional[float], 2025-05-07T20:31:55.2013406Z contiguous: bool, 2025-05-07T20:31:55.2013651Z compiled: bool, 2025-05-07T20:31:55.2013882Z ) -> None: 2025-05-07T20:31:55.2014102Z torch.manual_seed(2025) 2025-05-07T20:31:55.2014349Z 2025-05-07T20:31:55.2014619Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:55.2014966Z 2025-05-07T20:31:55.2015165Z x_sign = torch.sign(x) 2025-05-07T20:31:55.2015456Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:55.2015784Z x = x_sign * x_clamp 2025-05-07T20:31:55.2016035Z x0 = x[:, :D] 2025-05-07T20:31:55.2016256Z x1 = x[:, D:] 2025-05-07T20:31:55.2016468Z 2025-05-07T20:31:55.2016673Z if contiguous: 2025-05-07T20:31:55.2016906Z x0 = x0.contiguous() 2025-05-07T20:31:55.2017174Z x1 = x1.contiguous() 2025-05-07T20:31:55.2017420Z 2025-05-07T20:31:55.2017611Z if scale_ub is not None: 2025-05-07T20:31:55.2017892Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:55.2018235Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:55.2018543Z ) 2025-05-07T20:31:55.2018738Z else: 2025-05-07T20:31:55.2018953Z scale_ub_tensor = None 2025-05-07T20:31:55.2019200Z 2025-05-07T20:31:55.2019434Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:55.2019754Z op = silu_mul_quant 2025-05-07T20:31:55.2020009Z if compiled: 2025-05-07T20:31:55.2020258Z op = torch.compile(op) 2025-05-07T20:31:55.2020558Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.2020839Z 2025-05-07T20:31:55.2021028Z > y_fp8, y_scale = fn() 2025-05-07T20:31:55.2021464Z 2025-05-07T20:31:55.2021570Z moe/activation_test.py:117: 2025-05-07T20:31:55.2021872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.2022204Z moe/activation_test.py:115: in fn 2025-05-07T20:31:55.2022490Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.2023069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:55.2023644Z return fn(*args, **kwargs) 2025-05-07T20:31:55.2024319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:55.2025029Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:55.2025590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:55.2026283Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:55.2026972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:55.2027521Z kernel = self.compile( 2025-05-07T20:31:55.2028076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:55.2028748Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.2029159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.2029393Z 2025-05-07T20:31:55.2029608Z self = 2025-05-07T20:31:55.2030725Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:55.2033140Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4486dab060>} 2025-05-07T20:31:55.2035205Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:55.2036791Z context = 2025-05-07T20:31:55.2037212Z 2025-05-07T20:31:55.2037445Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:55.2038271Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.2039021Z module_map=module_map) 2025-05-07T20:31:55.2039600Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.2040153Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:55.2040547Z E ^ 2025-05-07T20:31:55.2041357Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.2042095Z 2025-05-07T20:31:55.2042801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:55.2043675Z 2025-05-07T20:31:55.2043837Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:55.2044496Z self=, 2025-05-07T20:31:55.2045140Z T=4096, 2025-05-07T20:31:55.2045426Z D=7168, 2025-05-07T20:31:55.2045704Z scale_ub=None, 2025-05-07T20:31:55.2046065Z contiguous=False, 2025-05-07T20:31:55.2046414Z compiled=True, 2025-05-07T20:31:55.2046729Z ) 2025-05-07T20:31:55.2047072Z self = 2025-05-07T20:31:55.2047592Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:55.2047883Z 2025-05-07T20:31:55.2048091Z @given( 2025-05-07T20:31:55.2048337Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:55.2048658Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:55.2048977Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:55.2049327Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:55.2049662Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:55.2049958Z ) 2025-05-07T20:31:55.2050323Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:55.2050775Z def test_silu_mul_quant( 2025-05-07T20:31:55.2051028Z self, 2025-05-07T20:31:55.2051241Z T: int, 2025-05-07T20:31:55.2051448Z D: int, 2025-05-07T20:31:55.2051677Z scale_ub: Optional[float], 2025-05-07T20:31:55.2052064Z contiguous: bool, 2025-05-07T20:31:55.2052310Z compiled: bool, 2025-05-07T20:31:55.2052532Z ) -> None: 2025-05-07T20:31:55.2052751Z torch.manual_seed(2025) 2025-05-07T20:31:55.2053004Z 2025-05-07T20:31:55.2053278Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:55.2053631Z 2025-05-07T20:31:55.2053834Z x_sign = torch.sign(x) 2025-05-07T20:31:55.2054128Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:55.2054446Z x = x_sign * x_clamp 2025-05-07T20:31:55.2054694Z x0 = x[:, :D] 2025-05-07T20:31:55.2054911Z x1 = x[:, D:] 2025-05-07T20:31:55.2055121Z 2025-05-07T20:31:55.2055315Z if contiguous: 2025-05-07T20:31:55.2055545Z x0 = x0.contiguous() 2025-05-07T20:31:55.2055808Z x1 = x1.contiguous() 2025-05-07T20:31:55.2056058Z 2025-05-07T20:31:55.2056249Z if scale_ub is not None: 2025-05-07T20:31:55.2056625Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:55.2056973Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:55.2057288Z ) 2025-05-07T20:31:55.2057485Z else: 2025-05-07T20:31:55.2057698Z scale_ub_tensor = None 2025-05-07T20:31:55.2057957Z 2025-05-07T20:31:55.2058188Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:55.2058514Z op = silu_mul_quant 2025-05-07T20:31:55.2058773Z if compiled: 2025-05-07T20:31:55.2059023Z op = torch.compile(op) 2025-05-07T20:31:55.2059329Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.2059610Z 2025-05-07T20:31:55.2059800Z > y_fp8, y_scale = fn() 2025-05-07T20:31:55.2059976Z 2025-05-07T20:31:55.2060080Z moe/activation_test.py:117: 2025-05-07T20:31:55.2060382Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.2060726Z moe/activation_test.py:115: in fn 2025-05-07T20:31:55.2061040Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.2061645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:55.2062231Z return fn(*args, **kwargs) 2025-05-07T20:31:55.2062908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:55.2063620Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:55.2064173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:55.2064877Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:55.2065561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:55.2066113Z kernel = self.compile( 2025-05-07T20:31:55.2066674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:55.2067357Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.2067859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.2068107Z 2025-05-07T20:31:55.2068323Z self = 2025-05-07T20:31:55.2069447Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:55.2070874Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4484df04a0>} 2025-05-07T20:31:55.2072269Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:55.2073336Z context = 2025-05-07T20:31:55.2073636Z 2025-05-07T20:31:55.2073814Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:55.2074359Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.2074837Z module_map=module_map) 2025-05-07T20:31:55.2075215Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.2075587Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:55.2075853Z E ^ 2025-05-07T20:31:55.2076340Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.2076809Z 2025-05-07T20:31:55.2077246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:55.2077860Z 2025-05-07T20:31:55.3661203Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:55.3661714Z self=, 2025-05-07T20:31:55.3662128Z T=16384, 2025-05-07T20:31:55.3662334Z D=5120, 2025-05-07T20:31:55.3662531Z scale_ub=1200.0, 2025-05-07T20:31:55.3662750Z contiguous=False, 2025-05-07T20:31:55.3662997Z compiled=False, 2025-05-07T20:31:55.3663212Z ) 2025-05-07T20:31:55.3663545Z self = 2025-05-07T20:31:55.3664074Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:55.3664378Z 2025-05-07T20:31:55.3664460Z @given( 2025-05-07T20:31:55.3664707Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:55.3665026Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:55.3665346Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:55.3665699Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:55.3666033Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:55.3666330Z ) 2025-05-07T20:31:55.3666701Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:55.3667153Z def test_silu_mul_quant( 2025-05-07T20:31:55.3667406Z self, 2025-05-07T20:31:55.3667615Z T: int, 2025-05-07T20:31:55.3667816Z D: int, 2025-05-07T20:31:55.3668049Z scale_ub: Optional[float], 2025-05-07T20:31:55.3668333Z contiguous: bool, 2025-05-07T20:31:55.3668573Z compiled: bool, 2025-05-07T20:31:55.3668809Z ) -> None: 2025-05-07T20:31:55.3669037Z torch.manual_seed(2025) 2025-05-07T20:31:55.3669288Z 2025-05-07T20:31:55.3669570Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:55.3669928Z 2025-05-07T20:31:55.3670131Z x_sign = torch.sign(x) 2025-05-07T20:31:55.3670434Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:55.3670757Z x = x_sign * x_clamp 2025-05-07T20:31:55.3671023Z x0 = x[:, :D] 2025-05-07T20:31:55.3671458Z x1 = x[:, D:] 2025-05-07T20:31:55.3671678Z 2025-05-07T20:31:55.3671870Z if contiguous: 2025-05-07T20:31:55.3672102Z x0 = x0.contiguous() 2025-05-07T20:31:55.3672366Z x1 = x1.contiguous() 2025-05-07T20:31:55.3672608Z 2025-05-07T20:31:55.3672794Z if scale_ub is not None: 2025-05-07T20:31:55.3673079Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:55.3673421Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:55.3673729Z ) 2025-05-07T20:31:55.3673927Z else: 2025-05-07T20:31:55.3674142Z scale_ub_tensor = None 2025-05-07T20:31:55.3674387Z 2025-05-07T20:31:55.3674633Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:55.3683347Z op = silu_mul_quant 2025-05-07T20:31:55.3683622Z if compiled: 2025-05-07T20:31:55.3683874Z op = torch.compile(op) 2025-05-07T20:31:55.3684186Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.3684480Z 2025-05-07T20:31:55.3684676Z > y_fp8, y_scale = fn() 2025-05-07T20:31:55.3684847Z 2025-05-07T20:31:55.3684946Z moe/activation_test.py:117: 2025-05-07T20:31:55.3685244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.3685579Z moe/activation_test.py:115: in fn 2025-05-07T20:31:55.3685870Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.3686597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:55.3687314Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:55.3687866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:55.3688761Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:55.3689451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:55.3690005Z kernel = self.compile( 2025-05-07T20:31:55.3690568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:55.3691249Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.3691664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.3691963Z 2025-05-07T20:31:55.3692174Z self = 2025-05-07T20:31:55.3693301Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:55.3694737Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44a9babd80>} 2025-05-07T20:31:55.3696135Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:55.3697187Z context = 2025-05-07T20:31:55.3697489Z 2025-05-07T20:31:55.3697659Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:55.3698195Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.3698674Z module_map=module_map) 2025-05-07T20:31:55.3699043Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.3699408Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:55.3699672Z E ^ 2025-05-07T20:31:55.3700145Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.3700709Z 2025-05-07T20:31:55.3701141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:55.3701678Z 2025-05-07T20:31:55.3701780Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:55.3702227Z self=, 2025-05-07T20:31:55.3702647Z T=16384, 2025-05-07T20:31:55.3702839Z D=5120, 2025-05-07T20:31:55.3703036Z scale_ub=1200.0, 2025-05-07T20:31:55.3703263Z contiguous=True, 2025-05-07T20:31:55.3703486Z compiled=True, 2025-05-07T20:31:55.3703691Z ) 2025-05-07T20:31:55.3704019Z self = 2025-05-07T20:31:55.3704526Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:55.3704817Z 2025-05-07T20:31:55.3704893Z @given( 2025-05-07T20:31:55.3705124Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:55.3705442Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:55.3705752Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:55.3706088Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:55.3706781Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:55.3707160Z ) 2025-05-07T20:31:55.3707652Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:55.3708308Z def test_silu_mul_quant( 2025-05-07T20:31:55.3708615Z self, 2025-05-07T20:31:55.3708865Z T: int, 2025-05-07T20:31:55.3709118Z D: int, 2025-05-07T20:31:55.3709394Z scale_ub: Optional[float], 2025-05-07T20:31:55.3709695Z contiguous: bool, 2025-05-07T20:31:55.3710113Z compiled: bool, 2025-05-07T20:31:55.3710336Z ) -> None: 2025-05-07T20:31:55.3710556Z torch.manual_seed(2025) 2025-05-07T20:31:55.3710802Z 2025-05-07T20:31:55.3711078Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:55.3711427Z 2025-05-07T20:31:55.3711618Z x_sign = torch.sign(x) 2025-05-07T20:31:55.3711913Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:55.3712224Z x = x_sign * x_clamp 2025-05-07T20:31:55.3712469Z x0 = x[:, :D] 2025-05-07T20:31:55.3712685Z x1 = x[:, D:] 2025-05-07T20:31:55.3712888Z 2025-05-07T20:31:55.3713073Z if contiguous: 2025-05-07T20:31:55.3713305Z x0 = x0.contiguous() 2025-05-07T20:31:55.3713558Z x1 = x1.contiguous() 2025-05-07T20:31:55.3713802Z 2025-05-07T20:31:55.3713996Z if scale_ub is not None: 2025-05-07T20:31:55.3714268Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:55.3714608Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:55.3714937Z ) 2025-05-07T20:31:55.3715126Z else: 2025-05-07T20:31:55.3715339Z scale_ub_tensor = None 2025-05-07T20:31:55.3715600Z 2025-05-07T20:31:55.3715832Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:55.3716154Z op = silu_mul_quant 2025-05-07T20:31:55.3716414Z if compiled: 2025-05-07T20:31:55.3716657Z op = torch.compile(op) 2025-05-07T20:31:55.3716960Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.3717239Z 2025-05-07T20:31:55.3717431Z > y_fp8, y_scale = fn() 2025-05-07T20:31:55.3717598Z 2025-05-07T20:31:55.3717697Z moe/activation_test.py:117: 2025-05-07T20:31:55.3718000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.3718339Z moe/activation_test.py:115: in fn 2025-05-07T20:31:55.3718616Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.3719196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:55.3719768Z return fn(*args, **kwargs) 2025-05-07T20:31:55.3720578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:55.3721283Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:55.3721833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:55.3722535Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:55.3723218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:55.3723766Z kernel = self.compile( 2025-05-07T20:31:55.3724320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:55.3724999Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.3725407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.3725649Z 2025-05-07T20:31:55.3725865Z self = 2025-05-07T20:31:55.3726982Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:55.3728404Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4484d772e0>} 2025-05-07T20:31:55.3729783Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:55.3730948Z context = 2025-05-07T20:31:55.3731268Z 2025-05-07T20:31:55.3731437Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:55.3732040Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.3732518Z module_map=module_map) 2025-05-07T20:31:55.3732889Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.3733251Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:55.3733515Z E ^ 2025-05-07T20:31:55.3733985Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.3734460Z 2025-05-07T20:31:55.3734886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:55.3735414Z 2025-05-07T20:31:55.5416890Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:55.5417363Z self=, 2025-05-07T20:31:55.5417957Z T=16384, 2025-05-07T20:31:55.5418162Z D=5120, 2025-05-07T20:31:55.5418368Z scale_ub=None, 2025-05-07T20:31:55.5418595Z contiguous=False, 2025-05-07T20:31:55.5418831Z compiled=True, 2025-05-07T20:31:55.5419047Z ) 2025-05-07T20:31:55.5419377Z self = 2025-05-07T20:31:55.5419903Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:55.5420194Z 2025-05-07T20:31:55.5420285Z @given( 2025-05-07T20:31:55.5420521Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:55.5420855Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:55.5421180Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:55.5421517Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:55.5421874Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:55.5422175Z ) 2025-05-07T20:31:55.5422543Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:55.5423165Z def test_silu_mul_quant( 2025-05-07T20:31:55.5423430Z self, 2025-05-07T20:31:55.5423645Z T: int, 2025-05-07T20:31:55.5423851Z D: int, 2025-05-07T20:31:55.5424088Z scale_ub: Optional[float], 2025-05-07T20:31:55.5424375Z contiguous: bool, 2025-05-07T20:31:55.5424622Z compiled: bool, 2025-05-07T20:31:55.5424863Z ) -> None: 2025-05-07T20:31:55.5425096Z torch.manual_seed(2025) 2025-05-07T20:31:55.5425341Z 2025-05-07T20:31:55.5425629Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:55.5425986Z 2025-05-07T20:31:55.5426183Z x_sign = torch.sign(x) 2025-05-07T20:31:55.5426487Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:55.5426819Z x = x_sign * x_clamp 2025-05-07T20:31:55.5427066Z x0 = x[:, :D] 2025-05-07T20:31:55.5427292Z x1 = x[:, D:] 2025-05-07T20:31:55.5427508Z 2025-05-07T20:31:55.5427695Z if contiguous: 2025-05-07T20:31:55.5427943Z x0 = x0.contiguous() 2025-05-07T20:31:55.5428212Z x1 = x1.contiguous() 2025-05-07T20:31:55.5428458Z 2025-05-07T20:31:55.5428650Z if scale_ub is not None: 2025-05-07T20:31:55.5428931Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:55.5429281Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:55.5429596Z ) 2025-05-07T20:31:55.5429792Z else: 2025-05-07T20:31:55.5430010Z scale_ub_tensor = None 2025-05-07T20:31:55.5430263Z 2025-05-07T20:31:55.5430503Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:55.5430830Z op = silu_mul_quant 2025-05-07T20:31:55.5431085Z if compiled: 2025-05-07T20:31:55.5431341Z op = torch.compile(op) 2025-05-07T20:31:55.5431775Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.5432055Z 2025-05-07T20:31:55.5432255Z > y_fp8, y_scale = fn() 2025-05-07T20:31:55.5432423Z 2025-05-07T20:31:55.5432536Z moe/activation_test.py:117: 2025-05-07T20:31:55.5432844Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.5433182Z moe/activation_test.py:115: in fn 2025-05-07T20:31:55.5433474Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.5434054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:55.5434631Z return fn(*args, **kwargs) 2025-05-07T20:31:55.5435317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:55.5436042Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:55.5436602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:55.5437313Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:55.5438010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:55.5438570Z kernel = self.compile( 2025-05-07T20:31:55.5439131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:55.5439817Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.5440234Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.5440473Z 2025-05-07T20:31:55.5440691Z self = 2025-05-07T20:31:55.5441868Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:55.5443421Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f447608f920>} 2025-05-07T20:31:55.5444823Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:55.5445893Z context = 2025-05-07T20:31:55.5446189Z 2025-05-07T20:31:55.5446367Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:55.5446910Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.5447398Z module_map=module_map) 2025-05-07T20:31:55.5447780Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.5448139Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:55.5448410Z E ^ 2025-05-07T20:31:55.5448896Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.5449365Z 2025-05-07T20:31:55.5449801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:55.5450334Z 2025-05-07T20:31:55.5450439Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:55.5450869Z self=, 2025-05-07T20:31:55.5451287Z T=2048, 2025-05-07T20:31:55.5451473Z D=5120, 2025-05-07T20:31:55.5451669Z scale_ub=None, 2025-05-07T20:31:55.5451965Z contiguous=False, 2025-05-07T20:31:55.5452191Z compiled=True, 2025-05-07T20:31:55.5452400Z ) 2025-05-07T20:31:55.8370764Z self = 2025-05-07T20:31:55.8372460Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:55.8372753Z 2025-05-07T20:31:55.8372831Z @given( 2025-05-07T20:31:55.8373078Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:55.8373393Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:55.8373706Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:55.8374050Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:55.8374380Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:55.8374671Z ) 2025-05-07T20:31:55.8375034Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:55.8375488Z def test_silu_mul_quant( 2025-05-07T20:31:55.8375744Z self, 2025-05-07T20:31:55.8375949Z T: int, 2025-05-07T20:31:55.8376148Z D: int, 2025-05-07T20:31:55.8376373Z scale_ub: Optional[float], 2025-05-07T20:31:55.8376664Z contiguous: bool, 2025-05-07T20:31:55.8376913Z compiled: bool, 2025-05-07T20:31:55.8377140Z ) -> None: 2025-05-07T20:31:55.8377367Z torch.manual_seed(2025) 2025-05-07T20:31:55.8377623Z 2025-05-07T20:31:55.8377904Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:55.8378260Z 2025-05-07T20:31:55.8378463Z x_sign = torch.sign(x) 2025-05-07T20:31:55.8378756Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:55.8379081Z x = x_sign * x_clamp 2025-05-07T20:31:55.8379329Z x0 = x[:, :D] 2025-05-07T20:31:55.8379546Z x1 = x[:, D:] 2025-05-07T20:31:55.8379756Z 2025-05-07T20:31:55.8379946Z if contiguous: 2025-05-07T20:31:55.8380177Z x0 = x0.contiguous() 2025-05-07T20:31:55.8380453Z x1 = x1.contiguous() 2025-05-07T20:31:55.8380699Z 2025-05-07T20:31:55.8380890Z if scale_ub is not None: 2025-05-07T20:31:55.8381168Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:55.8381522Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:55.8381844Z ) 2025-05-07T20:31:55.8382038Z else: 2025-05-07T20:31:55.8382388Z scale_ub_tensor = None 2025-05-07T20:31:55.8382653Z 2025-05-07T20:31:55.8382887Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:55.8383216Z op = silu_mul_quant 2025-05-07T20:31:55.8383473Z if compiled: 2025-05-07T20:31:55.8383719Z op = torch.compile(op) 2025-05-07T20:31:55.8384022Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.8384304Z 2025-05-07T20:31:55.8384496Z > y_fp8, y_scale = fn() 2025-05-07T20:31:55.8384670Z 2025-05-07T20:31:55.8384773Z moe/activation_test.py:117: 2025-05-07T20:31:55.8385082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.8385412Z moe/activation_test.py:115: in fn 2025-05-07T20:31:55.8385702Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.8386277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:55.8386851Z return fn(*args, **kwargs) 2025-05-07T20:31:55.8387527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:55.8388232Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:55.8388784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:55.8389484Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:55.8390161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:55.8390707Z kernel = self.compile( 2025-05-07T20:31:55.8391260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:55.8392019Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.8392426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.8392674Z 2025-05-07T20:31:55.8392886Z self = 2025-05-07T20:31:55.8394001Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:55.8395420Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f447608df80>} 2025-05-07T20:31:55.8396815Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:55.8397880Z context = 2025-05-07T20:31:55.8398174Z 2025-05-07T20:31:55.8398356Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:55.8398896Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.8399370Z module_map=module_map) 2025-05-07T20:31:55.8399743Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.8400108Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:55.8400363Z E ^ 2025-05-07T20:31:55.8400839Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.8401303Z 2025-05-07T20:31:55.8401740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:55.8402275Z 2025-05-07T20:31:55.8402385Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:55.8402805Z self=, 2025-05-07T20:31:55.8403302Z T=2048, 2025-05-07T20:31:55.8403496Z D=5120, 2025-05-07T20:31:55.8403685Z scale_ub=1200.0, 2025-05-07T20:31:55.8403914Z contiguous=False, 2025-05-07T20:31:55.8404141Z compiled=True, 2025-05-07T20:31:55.8404343Z ) 2025-05-07T20:31:55.8404669Z self = 2025-05-07T20:31:55.8405180Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:55.8405461Z 2025-05-07T20:31:55.8405545Z @given( 2025-05-07T20:31:55.8405772Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:55.8406091Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:55.8406659Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:55.8406999Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:55.8407335Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:55.8407624Z ) 2025-05-07T20:31:55.8407984Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:55.8408437Z def test_silu_mul_quant( 2025-05-07T20:31:55.8408683Z self, 2025-05-07T20:31:55.8408874Z T: int, 2025-05-07T20:31:55.8409073Z D: int, 2025-05-07T20:31:55.8409295Z scale_ub: Optional[float], 2025-05-07T20:31:55.8409564Z contiguous: bool, 2025-05-07T20:31:55.8409806Z compiled: bool, 2025-05-07T20:31:55.8410033Z ) -> None: 2025-05-07T20:31:55.8410246Z torch.manual_seed(2025) 2025-05-07T20:31:55.8410491Z 2025-05-07T20:31:55.8410768Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:55.8411118Z 2025-05-07T20:31:55.8411310Z x_sign = torch.sign(x) 2025-05-07T20:31:55.8411602Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:55.8412110Z x = x_sign * x_clamp 2025-05-07T20:31:55.8412348Z x0 = x[:, :D] 2025-05-07T20:31:55.8412567Z x1 = x[:, D:] 2025-05-07T20:31:55.8412778Z 2025-05-07T20:31:55.8412961Z if contiguous: 2025-05-07T20:31:55.8413194Z x0 = x0.contiguous() 2025-05-07T20:31:55.8413453Z x1 = x1.contiguous() 2025-05-07T20:31:55.8413687Z 2025-05-07T20:31:55.8413880Z if scale_ub is not None: 2025-05-07T20:31:55.8414156Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:55.8414489Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:55.8414803Z ) 2025-05-07T20:31:55.8414995Z else: 2025-05-07T20:31:55.8415202Z scale_ub_tensor = None 2025-05-07T20:31:55.8415457Z 2025-05-07T20:31:55.8415691Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:55.8416010Z op = silu_mul_quant 2025-05-07T20:31:55.8416265Z if compiled: 2025-05-07T20:31:55.8416518Z op = torch.compile(op) 2025-05-07T20:31:55.8416817Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.8417089Z 2025-05-07T20:31:55.8417287Z > y_fp8, y_scale = fn() 2025-05-07T20:31:55.8417454Z 2025-05-07T20:31:55.8417560Z moe/activation_test.py:117: 2025-05-07T20:31:55.8417858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.8418196Z moe/activation_test.py:115: in fn 2025-05-07T20:31:55.8418485Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.8419055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:55.8419629Z return fn(*args, **kwargs) 2025-05-07T20:31:55.8420306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:55.8421011Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:55.8421561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:55.8422382Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:55.8423072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:55.8423620Z kernel = self.compile( 2025-05-07T20:31:55.8424170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:55.8424845Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.8425251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.8425486Z 2025-05-07T20:31:55.8425699Z self = 2025-05-07T20:31:55.8426815Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:55.8428249Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4476fdca40>} 2025-05-07T20:31:55.8429637Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:55.8430696Z context = 2025-05-07T20:31:55.8431014Z 2025-05-07T20:31:55.8431209Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:55.8431745Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.8432333Z module_map=module_map) 2025-05-07T20:31:55.8432706Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.8433062Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:55.8433328Z E ^ 2025-05-07T20:31:55.8433812Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.8434278Z 2025-05-07T20:31:55.8434709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:55.8435243Z 2025-05-07T20:31:56.0157367Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.0157999Z self=, 2025-05-07T20:31:56.0158611Z T=4096, 2025-05-07T20:31:56.0158863Z D=5120, 2025-05-07T20:31:56.0159119Z scale_ub=1200.0, 2025-05-07T20:31:56.0159348Z contiguous=True, 2025-05-07T20:31:56.0159571Z compiled=True, 2025-05-07T20:31:56.0159804Z ) 2025-05-07T20:31:56.0160147Z self = 2025-05-07T20:31:56.0160659Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:56.0160955Z 2025-05-07T20:31:56.0161047Z @given( 2025-05-07T20:31:56.0161330Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.0161664Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.0161981Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.0162330Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.0162673Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.0162959Z ) 2025-05-07T20:31:56.0163328Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.0163784Z def test_silu_mul_quant( 2025-05-07T20:31:56.0164041Z self, 2025-05-07T20:31:56.0164246Z T: int, 2025-05-07T20:31:56.0164444Z D: int, 2025-05-07T20:31:56.0164676Z scale_ub: Optional[float], 2025-05-07T20:31:56.0164962Z contiguous: bool, 2025-05-07T20:31:56.0165205Z compiled: bool, 2025-05-07T20:31:56.0165440Z ) -> None: 2025-05-07T20:31:56.0165835Z torch.manual_seed(2025) 2025-05-07T20:31:56.0166091Z 2025-05-07T20:31:56.0166370Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.0166725Z 2025-05-07T20:31:56.0166927Z x_sign = torch.sign(x) 2025-05-07T20:31:56.0167226Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:56.0167550Z x = x_sign * x_clamp 2025-05-07T20:31:56.0167804Z x0 = x[:, :D] 2025-05-07T20:31:56.0168026Z x1 = x[:, D:] 2025-05-07T20:31:56.0168244Z 2025-05-07T20:31:56.0168442Z if contiguous: 2025-05-07T20:31:56.0168681Z x0 = x0.contiguous() 2025-05-07T20:31:56.0168957Z x1 = x1.contiguous() 2025-05-07T20:31:56.0169215Z 2025-05-07T20:31:56.0169414Z if scale_ub is not None: 2025-05-07T20:31:56.0169701Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:56.0170048Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:56.0170357Z ) 2025-05-07T20:31:56.0170569Z else: 2025-05-07T20:31:56.0170789Z scale_ub_tensor = None 2025-05-07T20:31:56.0171050Z 2025-05-07T20:31:56.0171289Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:56.0171623Z op = silu_mul_quant 2025-05-07T20:31:56.0171983Z if compiled: 2025-05-07T20:31:56.0172229Z op = torch.compile(op) 2025-05-07T20:31:56.0172530Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.0172809Z 2025-05-07T20:31:56.0173000Z > y_fp8, y_scale = fn() 2025-05-07T20:31:56.0173170Z 2025-05-07T20:31:56.0173272Z moe/activation_test.py:117: 2025-05-07T20:31:56.0173576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.0174051Z moe/activation_test.py:115: in fn 2025-05-07T20:31:56.0174339Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.0174921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:56.0175500Z return fn(*args, **kwargs) 2025-05-07T20:31:56.0176174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:56.0176885Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:56.0177437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:56.0178133Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:56.0178818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:56.0179367Z kernel = self.compile( 2025-05-07T20:31:56.0179933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:56.0180603Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.0181056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.0181308Z 2025-05-07T20:31:56.0181531Z self = 2025-05-07T20:31:56.0182650Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:56.0184070Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4476fde2a0>} 2025-05-07T20:31:56.0185466Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:56.0186611Z context = 2025-05-07T20:31:56.0186909Z 2025-05-07T20:31:56.0187090Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:56.0187626Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.0188108Z module_map=module_map) 2025-05-07T20:31:56.0188485Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.0188852Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.0189114Z E ^ 2025-05-07T20:31:56.0189599Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.0190063Z 2025-05-07T20:31:56.0190500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:56.0191052Z 2025-05-07T20:31:56.0191177Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.0191629Z self=, 2025-05-07T20:31:56.0192047Z T=128, 2025-05-07T20:31:56.0192240Z D=5120, 2025-05-07T20:31:56.0192437Z scale_ub=1200.0, 2025-05-07T20:31:56.0192667Z contiguous=False, 2025-05-07T20:31:56.0192900Z compiled=True, 2025-05-07T20:31:56.0193104Z ) 2025-05-07T20:31:56.1197621Z self = 2025-05-07T20:31:56.1199143Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:56.1199917Z 2025-05-07T20:31:56.1200133Z @given( 2025-05-07T20:31:56.1200650Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.1201204Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.1201767Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.1202109Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.1202449Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.1202741Z ) 2025-05-07T20:31:56.1203115Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.1203575Z def test_silu_mul_quant( 2025-05-07T20:31:56.1203823Z self, 2025-05-07T20:31:56.1204031Z T: int, 2025-05-07T20:31:56.1204245Z D: int, 2025-05-07T20:31:56.1204468Z scale_ub: Optional[float], 2025-05-07T20:31:56.1204754Z contiguous: bool, 2025-05-07T20:31:56.1205006Z compiled: bool, 2025-05-07T20:31:56.1205234Z ) -> None: 2025-05-07T20:31:56.1205458Z torch.manual_seed(2025) 2025-05-07T20:31:56.1205707Z 2025-05-07T20:31:56.1205981Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.1206513Z 2025-05-07T20:31:56.1206723Z x_sign = torch.sign(x) 2025-05-07T20:31:56.1207016Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:56.1207339Z x = x_sign * x_clamp 2025-05-07T20:31:56.1207588Z x0 = x[:, :D] 2025-05-07T20:31:56.1207818Z x1 = x[:, D:] 2025-05-07T20:31:56.1208023Z 2025-05-07T20:31:56.1208216Z if contiguous: 2025-05-07T20:31:56.1208454Z x0 = x0.contiguous() 2025-05-07T20:31:56.1208714Z x1 = x1.contiguous() 2025-05-07T20:31:56.1208958Z 2025-05-07T20:31:56.1209159Z if scale_ub is not None: 2025-05-07T20:31:56.1209434Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:56.1209775Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:56.1210092Z ) 2025-05-07T20:31:56.1210283Z else: 2025-05-07T20:31:56.1210502Z scale_ub_tensor = None 2025-05-07T20:31:56.1210758Z 2025-05-07T20:31:56.1210991Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:56.1211319Z op = silu_mul_quant 2025-05-07T20:31:56.1211583Z if compiled: 2025-05-07T20:31:56.1211904Z op = torch.compile(op) 2025-05-07T20:31:56.1212209Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.1212621Z 2025-05-07T20:31:56.1212824Z > y_fp8, y_scale = fn() 2025-05-07T20:31:56.1212991Z 2025-05-07T20:31:56.1213098Z moe/activation_test.py:117: 2025-05-07T20:31:56.1213403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.1213752Z moe/activation_test.py:115: in fn 2025-05-07T20:31:56.1214034Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.1214615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:56.1215193Z return fn(*args, **kwargs) 2025-05-07T20:31:56.1215871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:56.1216590Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:56.1217138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:56.1217843Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:56.1218520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:56.1219066Z kernel = self.compile( 2025-05-07T20:31:56.1219619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:56.1220296Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.1220697Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.1220938Z 2025-05-07T20:31:56.1221173Z self = 2025-05-07T20:31:56.1222474Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:56.1223892Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44779540e0>} 2025-05-07T20:31:56.1225272Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:56.1226329Z context = 2025-05-07T20:31:56.1226629Z 2025-05-07T20:31:56.1226799Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:56.1227335Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.1227816Z module_map=module_map) 2025-05-07T20:31:56.1228187Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.1228548Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.1228809Z E ^ 2025-05-07T20:31:56.1229281Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.1229755Z 2025-05-07T20:31:56.1230183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:56.1230712Z 2025-05-07T20:31:56.1230822Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.1231241Z self=, 2025-05-07T20:31:56.1231655Z T=16384, 2025-05-07T20:31:56.1231856Z D=7168, 2025-05-07T20:31:56.1232053Z scale_ub=1200.0, 2025-05-07T20:31:56.1232274Z contiguous=True, 2025-05-07T20:31:56.1232503Z compiled=True, 2025-05-07T20:31:56.1232708Z ) 2025-05-07T20:31:56.1233034Z self = 2025-05-07T20:31:56.1233629Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:56.1233917Z 2025-05-07T20:31:56.1234000Z @given( 2025-05-07T20:31:56.1234231Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.1234553Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.1234863Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.1235193Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.1235528Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.1235815Z ) 2025-05-07T20:31:56.1236176Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.1236623Z def test_silu_mul_quant( 2025-05-07T20:31:56.1236870Z self, 2025-05-07T20:31:56.1237074Z T: int, 2025-05-07T20:31:56.1237277Z D: int, 2025-05-07T20:31:56.1237495Z scale_ub: Optional[float], 2025-05-07T20:31:56.1237770Z contiguous: bool, 2025-05-07T20:31:56.1238015Z compiled: bool, 2025-05-07T20:31:56.1238248Z ) -> None: 2025-05-07T20:31:56.1238469Z torch.manual_seed(2025) 2025-05-07T20:31:56.1238721Z 2025-05-07T20:31:56.1238994Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.1239345Z 2025-05-07T20:31:56.1239543Z x_sign = torch.sign(x) 2025-05-07T20:31:56.1239833Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:56.1240148Z x = x_sign * x_clamp 2025-05-07T20:31:56.1240391Z x0 = x[:, :D] 2025-05-07T20:31:56.1240608Z x1 = x[:, D:] 2025-05-07T20:31:56.1240817Z 2025-05-07T20:31:56.1241004Z if contiguous: 2025-05-07T20:31:56.1241232Z x0 = x0.contiguous() 2025-05-07T20:31:56.1241494Z x1 = x1.contiguous() 2025-05-07T20:31:56.1241852Z 2025-05-07T20:31:56.1242042Z if scale_ub is not None: 2025-05-07T20:31:56.1242316Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:56.1242661Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:56.1242976Z ) 2025-05-07T20:31:56.1243168Z else: 2025-05-07T20:31:56.1243380Z scale_ub_tensor = None 2025-05-07T20:31:56.1243634Z 2025-05-07T20:31:56.1243861Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:56.1244183Z op = silu_mul_quant 2025-05-07T20:31:56.1244439Z if compiled: 2025-05-07T20:31:56.1244686Z op = torch.compile(op) 2025-05-07T20:31:56.1244986Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.1245267Z 2025-05-07T20:31:56.1245456Z > y_fp8, y_scale = fn() 2025-05-07T20:31:56.1245626Z 2025-05-07T20:31:56.1245726Z moe/activation_test.py:117: 2025-05-07T20:31:56.1246028Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.1246372Z moe/activation_test.py:115: in fn 2025-05-07T20:31:56.1246654Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.1247232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:56.1247811Z return fn(*args, **kwargs) 2025-05-07T20:31:56.1248482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:56.1249189Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:56.1249742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:56.1250446Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:56.1251130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:56.1251680Z kernel = self.compile( 2025-05-07T20:31:56.1252312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:56.1253065Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.1253474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.1253716Z 2025-05-07T20:31:56.1253927Z self = 2025-05-07T20:31:56.1255036Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:56.1256452Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4477956160>} 2025-05-07T20:31:56.1257845Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:56.1258910Z context = 2025-05-07T20:31:56.1259206Z 2025-05-07T20:31:56.1259380Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:56.1259914Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.1260388Z module_map=module_map) 2025-05-07T20:31:56.1260760Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.1261122Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.1261381Z E ^ 2025-05-07T20:31:56.1261861Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.1262410Z 2025-05-07T20:31:56.1262849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:56.1263377Z 2025-05-07T20:31:56.2464748Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.2465441Z self=, 2025-05-07T20:31:56.2466017Z T=16384, 2025-05-07T20:31:56.2466288Z D=5120, 2025-05-07T20:31:56.2466557Z scale_ub=1200.0, 2025-05-07T20:31:56.2466802Z contiguous=True, 2025-05-07T20:31:56.2467024Z compiled=False, 2025-05-07T20:31:56.2467234Z ) 2025-05-07T20:31:56.2467555Z self = 2025-05-07T20:31:56.2468069Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:56.2468365Z 2025-05-07T20:31:56.2468444Z @given( 2025-05-07T20:31:56.2468683Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.2469010Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.2469327Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.2469669Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.2470004Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.2470295Z ) 2025-05-07T20:31:56.2470656Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.2471114Z def test_silu_mul_quant( 2025-05-07T20:31:56.2471401Z self, 2025-05-07T20:31:56.2471602Z T: int, 2025-05-07T20:31:56.2471800Z D: int, 2025-05-07T20:31:56.2472024Z scale_ub: Optional[float], 2025-05-07T20:31:56.2472308Z contiguous: bool, 2025-05-07T20:31:56.2472556Z compiled: bool, 2025-05-07T20:31:56.2472784Z ) -> None: 2025-05-07T20:31:56.2473016Z torch.manual_seed(2025) 2025-05-07T20:31:56.2473272Z 2025-05-07T20:31:56.2473540Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.2473898Z 2025-05-07T20:31:56.2474098Z x_sign = torch.sign(x) 2025-05-07T20:31:56.2474392Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:56.2474711Z x = x_sign * x_clamp 2025-05-07T20:31:56.2475136Z x0 = x[:, :D] 2025-05-07T20:31:56.2475360Z x1 = x[:, D:] 2025-05-07T20:31:56.2475569Z 2025-05-07T20:31:56.2475762Z if contiguous: 2025-05-07T20:31:56.2475994Z x0 = x0.contiguous() 2025-05-07T20:31:56.2476263Z x1 = x1.contiguous() 2025-05-07T20:31:56.2476510Z 2025-05-07T20:31:56.2476702Z if scale_ub is not None: 2025-05-07T20:31:56.2476981Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:56.2477324Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:56.2477639Z ) 2025-05-07T20:31:56.2477831Z else: 2025-05-07T20:31:56.2478049Z scale_ub_tensor = None 2025-05-07T20:31:56.2478313Z 2025-05-07T20:31:56.2478547Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:56.2478871Z op = silu_mul_quant 2025-05-07T20:31:56.2479129Z if compiled: 2025-05-07T20:31:56.2479373Z op = torch.compile(op) 2025-05-07T20:31:56.2479677Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.2479959Z 2025-05-07T20:31:56.2480151Z > y_fp8, y_scale = fn() 2025-05-07T20:31:56.2480323Z 2025-05-07T20:31:56.2480430Z moe/activation_test.py:117: 2025-05-07T20:31:56.2480734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.2481066Z moe/activation_test.py:115: in fn 2025-05-07T20:31:56.2481367Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.2482118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:56.2482826Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:56.2483371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:56.2484208Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:56.2484897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:56.2485444Z kernel = self.compile( 2025-05-07T20:31:56.2485992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:56.2486671Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.2487078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.2487314Z 2025-05-07T20:31:56.2487528Z self = 2025-05-07T20:31:56.2488652Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:56.2490103Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4477847d80>} 2025-05-07T20:31:56.2491495Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:56.2492620Z context = 2025-05-07T20:31:56.2492916Z 2025-05-07T20:31:56.2493096Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:56.2493630Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.2494106Z module_map=module_map) 2025-05-07T20:31:56.2494489Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.2494849Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.2495109Z E ^ 2025-05-07T20:31:56.2495678Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.2496145Z 2025-05-07T20:31:56.2496584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:56.2497114Z 2025-05-07T20:31:56.2497217Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.2497642Z self=, 2025-05-07T20:31:56.2498061Z T=1, 2025-05-07T20:31:56.2498250Z D=7168, 2025-05-07T20:31:56.2498446Z scale_ub=1200.0, 2025-05-07T20:31:56.2498674Z contiguous=False, 2025-05-07T20:31:56.2498901Z compiled=False, 2025-05-07T20:31:56.2499105Z ) 2025-05-07T20:31:56.2499429Z self = 2025-05-07T20:31:56.2499938Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:56.2500211Z 2025-05-07T20:31:56.2500294Z @given( 2025-05-07T20:31:56.2500533Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.2500851Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.2501156Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.2501492Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.2501828Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.2502120Z ) 2025-05-07T20:31:56.2502469Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.2502920Z def test_silu_mul_quant( 2025-05-07T20:31:56.2503165Z self, 2025-05-07T20:31:56.2503356Z T: int, 2025-05-07T20:31:56.2503554Z D: int, 2025-05-07T20:31:56.2503776Z scale_ub: Optional[float], 2025-05-07T20:31:56.2504133Z contiguous: bool, 2025-05-07T20:31:56.2504374Z compiled: bool, 2025-05-07T20:31:56.2504601Z ) -> None: 2025-05-07T20:31:56.2504817Z torch.manual_seed(2025) 2025-05-07T20:31:56.2505059Z 2025-05-07T20:31:56.2505338Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.2505682Z 2025-05-07T20:31:56.2505877Z x_sign = torch.sign(x) 2025-05-07T20:31:56.2506349Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:56.2506664Z x = x_sign * x_clamp 2025-05-07T20:31:56.2506905Z x0 = x[:, :D] 2025-05-07T20:31:56.2507123Z x1 = x[:, D:] 2025-05-07T20:31:56.2507329Z 2025-05-07T20:31:56.2507516Z if contiguous: 2025-05-07T20:31:56.2507755Z x0 = x0.contiguous() 2025-05-07T20:31:56.2508017Z x1 = x1.contiguous() 2025-05-07T20:31:56.2508253Z 2025-05-07T20:31:56.2508444Z if scale_ub is not None: 2025-05-07T20:31:56.2508725Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:56.2515457Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:56.2515812Z ) 2025-05-07T20:31:56.2516012Z else: 2025-05-07T20:31:56.2516228Z scale_ub_tensor = None 2025-05-07T20:31:56.2516484Z 2025-05-07T20:31:56.2516727Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:56.2517048Z op = silu_mul_quant 2025-05-07T20:31:56.2517298Z if compiled: 2025-05-07T20:31:56.2517549Z op = torch.compile(op) 2025-05-07T20:31:56.2517848Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.2518125Z 2025-05-07T20:31:56.2518315Z > y_fp8, y_scale = fn() 2025-05-07T20:31:56.2518482Z 2025-05-07T20:31:56.2518589Z moe/activation_test.py:117: 2025-05-07T20:31:56.2518884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.2519217Z moe/activation_test.py:115: in fn 2025-05-07T20:31:56.2519505Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.2520214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:56.2521084Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:56.2521641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:56.2522339Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:56.2523026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:56.2523576Z kernel = self.compile( 2025-05-07T20:31:56.2524127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:56.2524797Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.2525201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.2525440Z 2025-05-07T20:31:56.2525655Z self = 2025-05-07T20:31:56.2526770Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:56.2528189Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4477e7fce0>} 2025-05-07T20:31:56.2529580Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:56.2530636Z context = 2025-05-07T20:31:56.2531073Z 2025-05-07T20:31:56.2531274Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:56.2531881Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.2532366Z module_map=module_map) 2025-05-07T20:31:56.2532737Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.2533093Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.2533357Z E ^ 2025-05-07T20:31:56.2533837Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.2534305Z 2025-05-07T20:31:56.2534738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:56.2535266Z 2025-05-07T20:31:56.4251682Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.4252409Z self=, 2025-05-07T20:31:56.4252992Z T=4096, 2025-05-07T20:31:56.4253254Z D=7168, 2025-05-07T20:31:56.4253513Z scale_ub=1200.0, 2025-05-07T20:31:56.4253796Z contiguous=False, 2025-05-07T20:31:56.4254096Z compiled=True, 2025-05-07T20:31:56.4254367Z ) 2025-05-07T20:31:56.4254727Z self = 2025-05-07T20:31:56.4255243Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:56.4255527Z 2025-05-07T20:31:56.4255616Z @given( 2025-05-07T20:31:56.4255846Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.4256166Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.4256480Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.4256808Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.4257138Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.4257434Z ) 2025-05-07T20:31:56.4257796Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.4258243Z def test_silu_mul_quant( 2025-05-07T20:31:56.4258490Z self, 2025-05-07T20:31:56.4258691Z T: int, 2025-05-07T20:31:56.4259054Z D: int, 2025-05-07T20:31:56.4259280Z scale_ub: Optional[float], 2025-05-07T20:31:56.4259554Z contiguous: bool, 2025-05-07T20:31:56.4259796Z compiled: bool, 2025-05-07T20:31:56.4260029Z ) -> None: 2025-05-07T20:31:56.4260245Z torch.manual_seed(2025) 2025-05-07T20:31:56.4260486Z 2025-05-07T20:31:56.4260767Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.4261114Z 2025-05-07T20:31:56.4261304Z x_sign = torch.sign(x) 2025-05-07T20:31:56.4261605Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:56.4261928Z x = x_sign * x_clamp 2025-05-07T20:31:56.4262167Z x0 = x[:, :D] 2025-05-07T20:31:56.4262386Z x1 = x[:, D:] 2025-05-07T20:31:56.4262601Z 2025-05-07T20:31:56.4262793Z if contiguous: 2025-05-07T20:31:56.4263023Z x0 = x0.contiguous() 2025-05-07T20:31:56.4263284Z x1 = x1.contiguous() 2025-05-07T20:31:56.4263540Z 2025-05-07T20:31:56.4263730Z if scale_ub is not None: 2025-05-07T20:31:56.4264008Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:56.4264348Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:56.4264656Z ) 2025-05-07T20:31:56.4264850Z else: 2025-05-07T20:31:56.4265060Z scale_ub_tensor = None 2025-05-07T20:31:56.4265309Z 2025-05-07T20:31:56.4265548Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:56.4265872Z op = silu_mul_quant 2025-05-07T20:31:56.4266121Z if compiled: 2025-05-07T20:31:56.4266372Z op = torch.compile(op) 2025-05-07T20:31:56.4266668Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.4267061Z 2025-05-07T20:31:56.4267255Z > y_fp8, y_scale = fn() 2025-05-07T20:31:56.4267421Z 2025-05-07T20:31:56.4267522Z moe/activation_test.py:117: 2025-05-07T20:31:56.4267824Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.4268157Z moe/activation_test.py:115: in fn 2025-05-07T20:31:56.4268436Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.4269005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:56.4269573Z return fn(*args, **kwargs) 2025-05-07T20:31:56.4270244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:56.4270945Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:56.4271490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:56.4272181Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:56.4272869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:56.4273418Z kernel = self.compile( 2025-05-07T20:31:56.4273966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:56.4274639Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.4275045Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.4275279Z 2025-05-07T20:31:56.4275497Z self = 2025-05-07T20:31:56.4276609Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:56.4278034Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44c6ff5f80>} 2025-05-07T20:31:56.4279515Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:56.4280573Z context = 2025-05-07T20:31:56.4280865Z 2025-05-07T20:31:56.4281038Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:56.4281594Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.4282095Z module_map=module_map) 2025-05-07T20:31:56.4282467Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.4282824Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.4283093Z E ^ 2025-05-07T20:31:56.4283566Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.4284029Z 2025-05-07T20:31:56.4284477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:56.4285005Z 2025-05-07T20:31:56.4285110Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.4285531Z self=, 2025-05-07T20:31:56.4285941Z T=128, 2025-05-07T20:31:56.4286131Z D=7168, 2025-05-07T20:31:56.4286320Z scale_ub=1200.0, 2025-05-07T20:31:56.4286548Z contiguous=False, 2025-05-07T20:31:56.4286773Z compiled=True, 2025-05-07T20:31:56.4286972Z ) 2025-05-07T20:31:56.5188193Z self = 2025-05-07T20:31:56.5188917Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:56.5189522Z 2025-05-07T20:31:56.5189628Z @given( 2025-05-07T20:31:56.5189874Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.5190191Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.5190520Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.5190861Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.5191225Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.5191541Z ) 2025-05-07T20:31:56.5191903Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.5192361Z def test_silu_mul_quant( 2025-05-07T20:31:56.5192610Z self, 2025-05-07T20:31:56.5192814Z T: int, 2025-05-07T20:31:56.5193020Z D: int, 2025-05-07T20:31:56.5193244Z scale_ub: Optional[float], 2025-05-07T20:31:56.5193530Z contiguous: bool, 2025-05-07T20:31:56.5193783Z compiled: bool, 2025-05-07T20:31:56.5194017Z ) -> None: 2025-05-07T20:31:56.5194242Z torch.manual_seed(2025) 2025-05-07T20:31:56.5194492Z 2025-05-07T20:31:56.5194766Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.5195122Z 2025-05-07T20:31:56.5195329Z x_sign = torch.sign(x) 2025-05-07T20:31:56.5195623Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:56.5195943Z x = x_sign * x_clamp 2025-05-07T20:31:56.5196191Z x0 = x[:, :D] 2025-05-07T20:31:56.5196415Z x1 = x[:, D:] 2025-05-07T20:31:56.5196628Z 2025-05-07T20:31:56.5196824Z if contiguous: 2025-05-07T20:31:56.5197066Z x0 = x0.contiguous() 2025-05-07T20:31:56.5197331Z x1 = x1.contiguous() 2025-05-07T20:31:56.5197582Z 2025-05-07T20:31:56.5197778Z if scale_ub is not None: 2025-05-07T20:31:56.5198055Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:56.5198404Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:56.5198730Z ) 2025-05-07T20:31:56.5198926Z else: 2025-05-07T20:31:56.5199149Z scale_ub_tensor = None 2025-05-07T20:31:56.5199411Z 2025-05-07T20:31:56.5199780Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:56.5200109Z op = silu_mul_quant 2025-05-07T20:31:56.5200366Z if compiled: 2025-05-07T20:31:56.5200620Z op = torch.compile(op) 2025-05-07T20:31:56.5200931Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.5201211Z 2025-05-07T20:31:56.5201415Z > y_fp8, y_scale = fn() 2025-05-07T20:31:56.5201583Z 2025-05-07T20:31:56.5201689Z moe/activation_test.py:117: 2025-05-07T20:31:56.5201993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.5202338Z moe/activation_test.py:115: in fn 2025-05-07T20:31:56.5202630Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.5203212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:56.5203794Z return fn(*args, **kwargs) 2025-05-07T20:31:56.5204488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:56.5205201Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:56.5205751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:56.5206632Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:56.5207317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:56.5207868Z kernel = self.compile( 2025-05-07T20:31:56.5208424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:56.5209103Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.5209676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.5209915Z 2025-05-07T20:31:56.5210138Z self = 2025-05-07T20:31:56.5211259Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:56.5212755Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f448572d9e0>} 2025-05-07T20:31:56.5214176Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:56.5215274Z context = 2025-05-07T20:31:56.5215577Z 2025-05-07T20:31:56.5215757Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:56.5216311Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.5216803Z module_map=module_map) 2025-05-07T20:31:56.5217187Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.5217551Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.5217818Z E ^ 2025-05-07T20:31:56.5218306Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.5218778Z 2025-05-07T20:31:56.5219218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:56.5219748Z 2025-05-07T20:31:56.5219855Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.5220291Z self=, 2025-05-07T20:31:56.5220705Z T=2048, 2025-05-07T20:31:56.5220897Z D=7168, 2025-05-07T20:31:56.5221099Z scale_ub=None, 2025-05-07T20:31:56.5221446Z contiguous=True, 2025-05-07T20:31:56.5221679Z compiled=True, 2025-05-07T20:31:56.5221890Z ) 2025-05-07T20:31:56.5222224Z self = 2025-05-07T20:31:56.5222730Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:56.5223017Z 2025-05-07T20:31:56.5223099Z @given( 2025-05-07T20:31:56.5223337Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.5223661Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.5223970Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.5224308Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.5224650Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.5224942Z ) 2025-05-07T20:31:56.5225301Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.5225759Z def test_silu_mul_quant( 2025-05-07T20:31:56.5226012Z self, 2025-05-07T20:31:56.5226217Z T: int, 2025-05-07T20:31:56.5226420Z D: int, 2025-05-07T20:31:56.5226642Z scale_ub: Optional[float], 2025-05-07T20:31:56.5226920Z contiguous: bool, 2025-05-07T20:31:56.5227169Z compiled: bool, 2025-05-07T20:31:56.5227396Z ) -> None: 2025-05-07T20:31:56.5227614Z torch.manual_seed(2025) 2025-05-07T20:31:56.5227860Z 2025-05-07T20:31:56.5228142Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.5228485Z 2025-05-07T20:31:56.5228684Z x_sign = torch.sign(x) 2025-05-07T20:31:56.5228980Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:56.5229294Z x = x_sign * x_clamp 2025-05-07T20:31:56.5229628Z x0 = x[:, :D] 2025-05-07T20:31:56.5229851Z x1 = x[:, D:] 2025-05-07T20:31:56.5230060Z 2025-05-07T20:31:56.5230255Z if contiguous: 2025-05-07T20:31:56.5230495Z x0 = x0.contiguous() 2025-05-07T20:31:56.5230764Z x1 = x1.contiguous() 2025-05-07T20:31:56.5231011Z 2025-05-07T20:31:56.5231209Z if scale_ub is not None: 2025-05-07T20:31:56.5231483Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:56.5231825Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:56.5232141Z ) 2025-05-07T20:31:56.5232336Z else: 2025-05-07T20:31:56.5232550Z scale_ub_tensor = None 2025-05-07T20:31:56.5232808Z 2025-05-07T20:31:56.5233046Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:56.5233366Z op = silu_mul_quant 2025-05-07T20:31:56.5233621Z if compiled: 2025-05-07T20:31:56.5233876Z op = torch.compile(op) 2025-05-07T20:31:56.5234183Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.5234463Z 2025-05-07T20:31:56.5234660Z > y_fp8, y_scale = fn() 2025-05-07T20:31:56.5234828Z 2025-05-07T20:31:56.5234929Z moe/activation_test.py:117: 2025-05-07T20:31:56.5235239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.5235580Z moe/activation_test.py:115: in fn 2025-05-07T20:31:56.5235861Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.5236435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:56.5237008Z return fn(*args, **kwargs) 2025-05-07T20:31:56.5237685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:56.5238387Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:56.5238943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:56.5239651Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:56.5240418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:56.5240968Z kernel = self.compile( 2025-05-07T20:31:56.5241522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:56.5242200Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.5242606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.5242848Z 2025-05-07T20:31:56.5243061Z self = 2025-05-07T20:31:56.5244176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:56.5245609Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f447521d800>} 2025-05-07T20:31:56.5247006Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:56.5248087Z context = 2025-05-07T20:31:56.5248393Z 2025-05-07T20:31:56.5248565Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:56.5249121Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.5249608Z module_map=module_map) 2025-05-07T20:31:56.5249983Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.5250433Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.5250701Z E ^ 2025-05-07T20:31:56.5251211Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.5251719Z 2025-05-07T20:31:56.5252217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:56.5252762Z 2025-05-07T20:31:56.5875930Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.5876591Z self=, 2025-05-07T20:31:56.5877184Z T=16384, 2025-05-07T20:31:56.5877443Z D=5120, 2025-05-07T20:31:56.5877706Z scale_ub=None, 2025-05-07T20:31:56.5878018Z contiguous=False, 2025-05-07T20:31:56.5878261Z compiled=False, 2025-05-07T20:31:56.5878469Z ) 2025-05-07T20:31:56.5878793Z self = 2025-05-07T20:31:56.5879315Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:56.5879615Z 2025-05-07T20:31:56.5879696Z @given( 2025-05-07T20:31:56.5879932Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.5880255Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.5880569Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.5880904Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.5881239Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.5881526Z ) 2025-05-07T20:31:56.5881882Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.5882336Z def test_silu_mul_quant( 2025-05-07T20:31:56.5882575Z self, 2025-05-07T20:31:56.5882779Z T: int, 2025-05-07T20:31:56.5882988Z D: int, 2025-05-07T20:31:56.5883209Z scale_ub: Optional[float], 2025-05-07T20:31:56.5883485Z contiguous: bool, 2025-05-07T20:31:56.5883736Z compiled: bool, 2025-05-07T20:31:56.5883959Z ) -> None: 2025-05-07T20:31:56.5884179Z torch.manual_seed(2025) 2025-05-07T20:31:56.5884421Z 2025-05-07T20:31:56.5884861Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.5885214Z 2025-05-07T20:31:56.5885406Z x_sign = torch.sign(x) 2025-05-07T20:31:56.5885694Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:56.5887800Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:56.5889763Z 2025-05-07T20:31:56.5889886Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:56.5890106Z 2025-05-07T20:31:56.5890210Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.5890643Z self=, 2025-05-07T20:31:56.5891086Z T=4096, 2025-05-07T20:31:56.5891274Z D=7168, 2025-05-07T20:31:56.5891470Z scale_ub=1200.0, 2025-05-07T20:31:56.5891697Z contiguous=True, 2025-05-07T20:31:56.5891989Z compiled=True, 2025-05-07T20:31:56.5892195Z ) 2025-05-07T20:31:56.5892521Z self = 2025-05-07T20:31:56.5893026Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:56.5893307Z 2025-05-07T20:31:56.5893389Z @given( 2025-05-07T20:31:56.5893621Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.5893945Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.5894382Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.5894717Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.5895059Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.5895343Z ) 2025-05-07T20:31:56.5895694Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.5896150Z def test_silu_mul_quant( 2025-05-07T20:31:56.5896393Z self, 2025-05-07T20:31:56.5896585Z T: int, 2025-05-07T20:31:56.5896793Z D: int, 2025-05-07T20:31:56.5897014Z scale_ub: Optional[float], 2025-05-07T20:31:56.5897286Z contiguous: bool, 2025-05-07T20:31:56.5897532Z compiled: bool, 2025-05-07T20:31:56.5897759Z ) -> None: 2025-05-07T20:31:56.5897971Z torch.manual_seed(2025) 2025-05-07T20:31:56.5898213Z 2025-05-07T20:31:56.5898489Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.5898834Z 2025-05-07T20:31:56.5899033Z x_sign = torch.sign(x) 2025-05-07T20:31:56.5899333Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:56.5901420Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:56.5903360Z 2025-05-07T20:31:56.5903490Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:56.5903705Z 2025-05-07T20:31:56.5903812Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.5904233Z self=, 2025-05-07T20:31:56.5904653Z T=16384, 2025-05-07T20:31:56.5904851Z D=7168, 2025-05-07T20:31:56.5905049Z scale_ub=None, 2025-05-07T20:31:56.5905266Z contiguous=False, 2025-05-07T20:31:56.5905608Z compiled=False, 2025-05-07T20:31:56.5905821Z ) 2025-05-07T20:31:56.5906335Z self = 2025-05-07T20:31:56.5906863Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:56.5907153Z 2025-05-07T20:31:56.5913914Z @given( 2025-05-07T20:31:56.5914201Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.5914559Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.5914895Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.5915256Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.5915593Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.5915893Z ) 2025-05-07T20:31:56.5916253Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.5916710Z def test_silu_mul_quant( 2025-05-07T20:31:56.5916955Z self, 2025-05-07T20:31:56.5917158Z T: int, 2025-05-07T20:31:56.5917356Z D: int, 2025-05-07T20:31:56.5917571Z scale_ub: Optional[float], 2025-05-07T20:31:56.5917847Z contiguous: bool, 2025-05-07T20:31:56.5918090Z compiled: bool, 2025-05-07T20:31:56.5918319Z ) -> None: 2025-05-07T20:31:56.5918532Z torch.manual_seed(2025) 2025-05-07T20:31:56.5918784Z 2025-05-07T20:31:56.5919063Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.5921219Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:56.5923336Z 2025-05-07T20:31:56.5923461Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:56.5923678Z 2025-05-07T20:31:56.5923783Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.5924207Z self=, 2025-05-07T20:31:56.5924621Z T=2048, 2025-05-07T20:31:56.5924812Z D=7168, 2025-05-07T20:31:56.5925008Z scale_ub=1200.0, 2025-05-07T20:31:56.5925233Z contiguous=True, 2025-05-07T20:31:56.5925453Z compiled=True, 2025-05-07T20:31:56.5925661Z ) 2025-05-07T20:31:56.5925988Z self = 2025-05-07T20:31:56.5926501Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:56.5926788Z 2025-05-07T20:31:56.5926867Z @given( 2025-05-07T20:31:56.5927102Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.5927425Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.5927736Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.5928073Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.5928406Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.5928694Z ) 2025-05-07T20:31:56.5929050Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.5929504Z def test_silu_mul_quant( 2025-05-07T20:31:56.5929747Z self, 2025-05-07T20:31:56.5929946Z T: int, 2025-05-07T20:31:56.5930146Z D: int, 2025-05-07T20:31:56.5930363Z scale_ub: Optional[float], 2025-05-07T20:31:56.5930640Z contiguous: bool, 2025-05-07T20:31:56.5930888Z compiled: bool, 2025-05-07T20:31:56.5931110Z ) -> None: 2025-05-07T20:31:56.5931333Z torch.manual_seed(2025) 2025-05-07T20:31:56.5931581Z 2025-05-07T20:31:56.5932042Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.5932388Z 2025-05-07T20:31:56.5932584Z x_sign = torch.sign(x) 2025-05-07T20:31:56.5932878Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:56.5934948Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:56.5936887Z 2025-05-07T20:31:56.5937009Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:56.5937232Z 2025-05-07T20:31:56.5937336Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.5937762Z self=, 2025-05-07T20:31:56.5938176Z T=2048, 2025-05-07T20:31:56.5938362Z D=7168, 2025-05-07T20:31:56.5938558Z scale_ub=None, 2025-05-07T20:31:56.5938773Z contiguous=True, 2025-05-07T20:31:56.5938997Z compiled=False, 2025-05-07T20:31:56.5939204Z ) 2025-05-07T20:31:56.7042423Z self = 2025-05-07T20:31:56.7043214Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:56.7043597Z 2025-05-07T20:31:56.7043700Z @given( 2025-05-07T20:31:56.7044018Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.7044369Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.7044853Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.7045189Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.7045523Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.7045824Z ) 2025-05-07T20:31:56.7046175Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.7046624Z def test_silu_mul_quant( 2025-05-07T20:31:56.7046875Z self, 2025-05-07T20:31:56.7047067Z T: int, 2025-05-07T20:31:56.7047266Z D: int, 2025-05-07T20:31:56.7047486Z scale_ub: Optional[float], 2025-05-07T20:31:56.7047754Z contiguous: bool, 2025-05-07T20:31:56.7047996Z compiled: bool, 2025-05-07T20:31:56.7048222Z ) -> None: 2025-05-07T20:31:56.7048435Z torch.manual_seed(2025) 2025-05-07T20:31:56.7048678Z 2025-05-07T20:31:56.7048957Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.7049305Z 2025-05-07T20:31:56.7049503Z > x_sign = torch.sign(x) 2025-05-07T20:31:56.7051533Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:56.7053571Z 2025-05-07T20:31:56.7053690Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:56.7053908Z 2025-05-07T20:31:56.7054018Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.7054435Z self=, 2025-05-07T20:31:56.7054846Z T=1, 2025-05-07T20:31:56.7055036Z D=7168, 2025-05-07T20:31:56.7055228Z scale_ub=1200.0, 2025-05-07T20:31:56.7055455Z contiguous=True, 2025-05-07T20:31:56.7055680Z compiled=False, 2025-05-07T20:31:56.7055880Z ) 2025-05-07T20:31:56.7056333Z self = 2025-05-07T20:31:56.7056839Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:56.7057111Z 2025-05-07T20:31:56.7057201Z @given( 2025-05-07T20:31:56.7057434Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.7057756Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.7058071Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.7058399Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.7058737Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.7059025Z ) 2025-05-07T20:31:56.7059376Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.7059832Z def test_silu_mul_quant( 2025-05-07T20:31:56.7060080Z self, 2025-05-07T20:31:56.7060272Z T: int, 2025-05-07T20:31:56.7060470Z D: int, 2025-05-07T20:31:56.7060691Z scale_ub: Optional[float], 2025-05-07T20:31:56.7060970Z contiguous: bool, 2025-05-07T20:31:56.7061210Z compiled: bool, 2025-05-07T20:31:56.7061432Z ) -> None: 2025-05-07T20:31:56.7061653Z torch.manual_seed(2025) 2025-05-07T20:31:56.7061897Z 2025-05-07T20:31:56.7062175Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.7062520Z 2025-05-07T20:31:56.7062711Z x_sign = torch.sign(x) 2025-05-07T20:31:56.7063005Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:56.7063323Z x = x_sign * x_clamp 2025-05-07T20:31:56.7063562Z x0 = x[:, :D] 2025-05-07T20:31:56.7063781Z x1 = x[:, D:] 2025-05-07T20:31:56.7063994Z 2025-05-07T20:31:56.7064173Z if contiguous: 2025-05-07T20:31:56.7064494Z x0 = x0.contiguous() 2025-05-07T20:31:56.7064750Z x1 = x1.contiguous() 2025-05-07T20:31:56.7064986Z 2025-05-07T20:31:56.7065176Z if scale_ub is not None: 2025-05-07T20:31:56.7065457Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:56.7065794Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:56.7066115Z ) 2025-05-07T20:31:56.7066313Z else: 2025-05-07T20:31:56.7066532Z scale_ub_tensor = None 2025-05-07T20:31:56.7066786Z 2025-05-07T20:31:56.7067026Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:56.7067353Z op = silu_mul_quant 2025-05-07T20:31:56.7067604Z if compiled: 2025-05-07T20:31:56.7067855Z op = torch.compile(op) 2025-05-07T20:31:56.7068160Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.7068434Z 2025-05-07T20:31:56.7068630Z > y_fp8, y_scale = fn() 2025-05-07T20:31:56.7068802Z 2025-05-07T20:31:56.7068912Z moe/activation_test.py:117: 2025-05-07T20:31:56.7069210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.7069550Z moe/activation_test.py:115: in fn 2025-05-07T20:31:56.7069844Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.7070558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:56.7071278Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:56.7071876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:56.7072584Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:56.7073270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:56.7073819Z kernel = self.compile( 2025-05-07T20:31:56.7074388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:56.7075069Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.7075559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.7075804Z 2025-05-07T20:31:56.7076029Z self = 2025-05-07T20:31:56.7077154Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:56.7078580Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4474e5cea0>} 2025-05-07T20:31:56.7079983Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:56.7081047Z context = 2025-05-07T20:31:56.7081351Z 2025-05-07T20:31:56.7081524Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:56.7082111Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.7082587Z module_map=module_map) 2025-05-07T20:31:56.7082956Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.7083319Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.7083586Z E ^ 2025-05-07T20:31:56.7084062Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.7084531Z 2025-05-07T20:31:56.7084965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:56.7085586Z 2025-05-07T20:31:56.7085695Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.7086126Z self=, 2025-05-07T20:31:56.7086535Z T=128, 2025-05-07T20:31:56.7086728Z D=5120, 2025-05-07T20:31:56.7086924Z scale_ub=None, 2025-05-07T20:31:56.7087137Z contiguous=True, 2025-05-07T20:31:56.7087365Z compiled=False, 2025-05-07T20:31:56.7087576Z ) 2025-05-07T20:31:56.7752386Z self = 2025-05-07T20:31:56.7753179Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:56.7753570Z 2025-05-07T20:31:56.7753679Z @given( 2025-05-07T20:31:56.7753998Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.7754425Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.7754790Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.7755135Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.7755468Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.7755751Z ) 2025-05-07T20:31:56.7756113Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.7756566Z def test_silu_mul_quant( 2025-05-07T20:31:56.7756813Z self, 2025-05-07T20:31:56.7757007Z T: int, 2025-05-07T20:31:56.7757209Z D: int, 2025-05-07T20:31:56.7757433Z scale_ub: Optional[float], 2025-05-07T20:31:56.7757705Z contiguous: bool, 2025-05-07T20:31:56.7757945Z compiled: bool, 2025-05-07T20:31:56.7758172Z ) -> None: 2025-05-07T20:31:56.7758389Z torch.manual_seed(2025) 2025-05-07T20:31:56.7758634Z 2025-05-07T20:31:56.7758916Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.7759263Z 2025-05-07T20:31:56.7759456Z x_sign = torch.sign(x) 2025-05-07T20:31:56.7759755Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:56.7760071Z x = x_sign * x_clamp 2025-05-07T20:31:56.7760314Z x0 = x[:, :D] 2025-05-07T20:31:56.7760732Z x1 = x[:, D:] 2025-05-07T20:31:56.7760951Z 2025-05-07T20:31:56.7761137Z if contiguous: 2025-05-07T20:31:56.7761405Z x0 = x0.contiguous() 2025-05-07T20:31:56.7761699Z x1 = x1.contiguous() 2025-05-07T20:31:56.7761935Z 2025-05-07T20:31:56.7762133Z if scale_ub is not None: 2025-05-07T20:31:56.7762405Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:56.7762741Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:56.7763056Z ) 2025-05-07T20:31:56.7763254Z else: 2025-05-07T20:31:56.7763460Z scale_ub_tensor = None 2025-05-07T20:31:56.7763711Z 2025-05-07T20:31:56.7763946Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:56.7764268Z op = silu_mul_quant 2025-05-07T20:31:56.7764529Z if compiled: 2025-05-07T20:31:56.7764779Z op = torch.compile(op) 2025-05-07T20:31:56.7765080Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.7765364Z 2025-05-07T20:31:56.7765562Z > y_fp8, y_scale = fn() 2025-05-07T20:31:56.7765729Z 2025-05-07T20:31:56.7765841Z moe/activation_test.py:117: 2025-05-07T20:31:56.7766142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.7766479Z moe/activation_test.py:115: in fn 2025-05-07T20:31:56.7766766Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.7767469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:56.7768177Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:56.7768732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:56.7769569Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:56.7770255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:56.7770806Z kernel = self.compile( 2025-05-07T20:31:56.7771362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:56.7772122Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.7772582Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.7772853Z 2025-05-07T20:31:56.7773083Z self = 2025-05-07T20:31:56.7774409Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:56.7776127Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4474e5df80>} 2025-05-07T20:31:56.7777796Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:56.7779058Z context = 2025-05-07T20:31:56.7779399Z 2025-05-07T20:31:56.7779587Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:56.7780196Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.7780738Z module_map=module_map) 2025-05-07T20:31:56.7781152Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.7781561Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.7781840Z E ^ 2025-05-07T20:31:56.7782390Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.7783029Z 2025-05-07T20:31:56.7783533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:56.7784157Z 2025-05-07T20:31:56.7784274Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.7784744Z self=, 2025-05-07T20:31:56.7785213Z T=128, 2025-05-07T20:31:56.7785413Z D=7168, 2025-05-07T20:31:56.7785613Z scale_ub=None, 2025-05-07T20:31:56.7785845Z contiguous=True, 2025-05-07T20:31:56.7786086Z compiled=False, 2025-05-07T20:31:56.7786299Z ) 2025-05-07T20:31:56.7786661Z self = 2025-05-07T20:31:56.7787240Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:56.7787552Z 2025-05-07T20:31:56.7787636Z @given( 2025-05-07T20:31:56.7787877Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.7788234Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.7788575Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.7788937Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.7789308Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.7789627Z ) 2025-05-07T20:31:56.7790022Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.7790537Z def test_silu_mul_quant( 2025-05-07T20:31:56.7790797Z self, 2025-05-07T20:31:56.7791002Z T: int, 2025-05-07T20:31:56.7791229Z D: int, 2025-05-07T20:31:56.7791511Z scale_ub: Optional[float], 2025-05-07T20:31:56.7791806Z contiguous: bool, 2025-05-07T20:31:56.7792159Z compiled: bool, 2025-05-07T20:31:56.7792396Z ) -> None: 2025-05-07T20:31:56.7792623Z torch.manual_seed(2025) 2025-05-07T20:31:56.7792889Z 2025-05-07T20:31:56.7793186Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.7793574Z 2025-05-07T20:31:56.7793784Z x_sign = torch.sign(x) 2025-05-07T20:31:56.7794098Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:56.7794440Z x = x_sign * x_clamp 2025-05-07T20:31:56.7794698Z x0 = x[:, :D] 2025-05-07T20:31:56.7794926Z x1 = x[:, D:] 2025-05-07T20:31:56.7795145Z 2025-05-07T20:31:56.7795341Z if contiguous: 2025-05-07T20:31:56.7795586Z x0 = x0.contiguous() 2025-05-07T20:31:56.7795863Z x1 = x1.contiguous() 2025-05-07T20:31:56.7796125Z 2025-05-07T20:31:56.7796321Z if scale_ub is not None: 2025-05-07T20:31:56.7796622Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:56.7796995Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:56.7797349Z ) 2025-05-07T20:31:56.7797551Z else: 2025-05-07T20:31:56.7797774Z scale_ub_tensor = None 2025-05-07T20:31:56.7798050Z 2025-05-07T20:31:56.7798298Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:56.7798651Z op = silu_mul_quant 2025-05-07T20:31:56.7798922Z if compiled: 2025-05-07T20:31:56.7799184Z op = torch.compile(op) 2025-05-07T20:31:56.7799507Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.7799818Z 2025-05-07T20:31:56.7800016Z > y_fp8, y_scale = fn() 2025-05-07T20:31:56.7800199Z 2025-05-07T20:31:56.7800304Z moe/activation_test.py:117: 2025-05-07T20:31:56.7800630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.7801002Z moe/activation_test.py:115: in fn 2025-05-07T20:31:56.7801314Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.7802143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:56.7802971Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:56.7803683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:56.7804500Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:56.7805293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:56.7805928Z kernel = self.compile( 2025-05-07T20:31:56.7806677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:56.7807357Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.7807768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.7808009Z 2025-05-07T20:31:56.7808221Z self = 2025-05-07T20:31:56.7809336Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:56.7810755Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4474e5ee80>} 2025-05-07T20:31:56.7812264Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:56.7813322Z context = 2025-05-07T20:31:56.7813614Z 2025-05-07T20:31:56.7813783Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:56.7814455Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.7814936Z module_map=module_map) 2025-05-07T20:31:56.7815309Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.7815671Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.7815939Z E ^ 2025-05-07T20:31:56.7816415Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.7816881Z 2025-05-07T20:31:56.7817315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:56.7817848Z 2025-05-07T20:31:56.7817956Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.7818382Z self=, 2025-05-07T20:31:56.7818795Z T=2048, 2025-05-07T20:31:56.7818988Z D=7168, 2025-05-07T20:31:56.7819185Z scale_ub=1200.0, 2025-05-07T20:31:56.7819411Z contiguous=True, 2025-05-07T20:31:56.7819634Z compiled=False, 2025-05-07T20:31:56.7819839Z ) 2025-05-07T20:31:56.8613698Z self = 2025-05-07T20:31:56.8614484Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:56.8614875Z 2025-05-07T20:31:56.8614982Z @given( 2025-05-07T20:31:56.8615274Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.8615594Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.8615908Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.8616246Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.8616577Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.8616871Z ) 2025-05-07T20:31:56.8617230Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.8617682Z def test_silu_mul_quant( 2025-05-07T20:31:56.8617928Z self, 2025-05-07T20:31:56.8618126Z T: int, 2025-05-07T20:31:56.8618320Z D: int, 2025-05-07T20:31:56.8618705Z scale_ub: Optional[float], 2025-05-07T20:31:56.8618989Z contiguous: bool, 2025-05-07T20:31:56.8619225Z compiled: bool, 2025-05-07T20:31:56.8619451Z ) -> None: 2025-05-07T20:31:56.8619670Z torch.manual_seed(2025) 2025-05-07T20:31:56.8619922Z 2025-05-07T20:31:56.8620196Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.8622327Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:56.8624276Z 2025-05-07T20:31:56.8624402Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:56.8624619Z 2025-05-07T20:31:56.8624731Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.8625153Z self=, 2025-05-07T20:31:56.8625567Z T=1, 2025-05-07T20:31:56.8625753Z D=5120, 2025-05-07T20:31:56.8625948Z scale_ub=1200.0, 2025-05-07T20:31:56.8626170Z contiguous=True, 2025-05-07T20:31:56.8626393Z compiled=False, 2025-05-07T20:31:56.8626604Z ) 2025-05-07T20:31:56.8626926Z self = 2025-05-07T20:31:56.8627433Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:56.8627707Z 2025-05-07T20:31:56.8627941Z @given( 2025-05-07T20:31:56.8628175Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.8628501Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.8628826Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.8636367Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.8636720Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.8637018Z ) 2025-05-07T20:31:56.8637376Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.8637839Z def test_silu_mul_quant( 2025-05-07T20:31:56.8638086Z self, 2025-05-07T20:31:56.8638279Z T: int, 2025-05-07T20:31:56.8638474Z D: int, 2025-05-07T20:31:56.8638687Z scale_ub: Optional[float], 2025-05-07T20:31:56.8638961Z contiguous: bool, 2025-05-07T20:31:56.8639203Z compiled: bool, 2025-05-07T20:31:56.8639430Z ) -> None: 2025-05-07T20:31:56.8639642Z torch.manual_seed(2025) 2025-05-07T20:31:56.8639899Z 2025-05-07T20:31:56.8640179Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.8640529Z 2025-05-07T20:31:56.8640717Z x_sign = torch.sign(x) 2025-05-07T20:31:56.8641017Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:56.8641331Z x = x_sign * x_clamp 2025-05-07T20:31:56.8641570Z x0 = x[:, :D] 2025-05-07T20:31:56.8641788Z x1 = x[:, D:] 2025-05-07T20:31:56.8641999Z 2025-05-07T20:31:56.8642182Z if contiguous: 2025-05-07T20:31:56.8642413Z x0 = x0.contiguous() 2025-05-07T20:31:56.8642683Z x1 = x1.contiguous() 2025-05-07T20:31:56.8642916Z 2025-05-07T20:31:56.8643111Z if scale_ub is not None: 2025-05-07T20:31:56.8643391Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:56.8643731Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:56.8644043Z ) 2025-05-07T20:31:56.8644242Z else: 2025-05-07T20:31:56.8644455Z scale_ub_tensor = None 2025-05-07T20:31:56.8644703Z 2025-05-07T20:31:56.8644932Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:56.8645249Z op = silu_mul_quant 2025-05-07T20:31:56.8645603Z if compiled: 2025-05-07T20:31:56.8645854Z op = torch.compile(op) 2025-05-07T20:31:56.8646153Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.8646426Z 2025-05-07T20:31:56.8646617Z > y_fp8, y_scale = fn() 2025-05-07T20:31:56.8646784Z 2025-05-07T20:31:56.8646890Z moe/activation_test.py:117: 2025-05-07T20:31:56.8647188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.8647529Z moe/activation_test.py:115: in fn 2025-05-07T20:31:56.8647818Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.8648533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:56.8649247Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:56.8649802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:56.8650507Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:56.8651187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:56.8651737Z kernel = self.compile( 2025-05-07T20:31:56.8652373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:56.8653050Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.8653462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.8653699Z 2025-05-07T20:31:56.8653911Z self = 2025-05-07T20:31:56.8655121Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:56.8656540Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4474a1c400>} 2025-05-07T20:31:56.8657925Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:56.8658985Z context = 2025-05-07T20:31:56.8659283Z 2025-05-07T20:31:56.8659453Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:56.8659993Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.8660479Z module_map=module_map) 2025-05-07T20:31:56.8660853Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.8661217Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.8661481Z E ^ 2025-05-07T20:31:56.8661953Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.8662422Z 2025-05-07T20:31:56.8662851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:56.8663379Z 2025-05-07T20:31:56.8663490Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.8663908Z self=, 2025-05-07T20:31:56.8664322Z T=2048, 2025-05-07T20:31:56.8664510Z D=5120, 2025-05-07T20:31:56.8664698Z scale_ub=None, 2025-05-07T20:31:56.8664909Z contiguous=True, 2025-05-07T20:31:56.8665144Z compiled=False, 2025-05-07T20:31:56.8665348Z ) 2025-05-07T20:31:56.8665669Z self = 2025-05-07T20:31:56.8666257Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:56.8666540Z 2025-05-07T20:31:56.8666621Z @given( 2025-05-07T20:31:56.8666846Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.8667164Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.8667478Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.8667808Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.8668145Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.8668435Z ) 2025-05-07T20:31:56.8668794Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.8669239Z def test_silu_mul_quant( 2025-05-07T20:31:56.8669482Z self, 2025-05-07T20:31:56.8669685Z T: int, 2025-05-07T20:31:56.8669879Z D: int, 2025-05-07T20:31:56.8670096Z scale_ub: Optional[float], 2025-05-07T20:31:56.8670365Z contiguous: bool, 2025-05-07T20:31:56.8670599Z compiled: bool, 2025-05-07T20:31:56.8670830Z ) -> None: 2025-05-07T20:31:56.8671046Z torch.manual_seed(2025) 2025-05-07T20:31:56.8671297Z 2025-05-07T20:31:56.8671611Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.8671957Z 2025-05-07T20:31:56.8672147Z > x_sign = torch.sign(x) 2025-05-07T20:31:56.8674174Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:56.8676193Z 2025-05-07T20:31:56.8676318Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:56.8676538Z 2025-05-07T20:31:56.8676639Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.8677061Z self=, 2025-05-07T20:31:56.8677474Z T=16384, 2025-05-07T20:31:56.8677667Z D=5120, 2025-05-07T20:31:56.8677857Z scale_ub=None, 2025-05-07T20:31:56.8678064Z contiguous=True, 2025-05-07T20:31:56.8678287Z compiled=False, 2025-05-07T20:31:56.8678488Z ) 2025-05-07T20:31:56.9416944Z self = 2025-05-07T20:31:56.9417740Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:56.9418125Z 2025-05-07T20:31:56.9418236Z @given( 2025-05-07T20:31:56.9418498Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.9418813Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.9419120Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.9419450Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.9419776Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.9420058Z ) 2025-05-07T20:31:56.9420411Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.9420868Z def test_silu_mul_quant( 2025-05-07T20:31:56.9421119Z self, 2025-05-07T20:31:56.9421335Z T: int, 2025-05-07T20:31:56.9421552Z D: int, 2025-05-07T20:31:56.9421771Z scale_ub: Optional[float], 2025-05-07T20:31:56.9422035Z contiguous: bool, 2025-05-07T20:31:56.9422278Z compiled: bool, 2025-05-07T20:31:56.9422497Z ) -> None: 2025-05-07T20:31:56.9422710Z torch.manual_seed(2025) 2025-05-07T20:31:56.9422961Z 2025-05-07T20:31:56.9423237Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.9425603Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:56.9427584Z 2025-05-07T20:31:56.9427713Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:56.9427933Z 2025-05-07T20:31:56.9428040Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.9428473Z self=, 2025-05-07T20:31:56.9428897Z T=4096, 2025-05-07T20:31:56.9429080Z D=5120, 2025-05-07T20:31:56.9429279Z scale_ub=None, 2025-05-07T20:31:56.9429494Z contiguous=True, 2025-05-07T20:31:56.9429721Z compiled=False, 2025-05-07T20:31:56.9429920Z ) 2025-05-07T20:31:56.9430255Z self = 2025-05-07T20:31:56.9430771Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:56.9431054Z 2025-05-07T20:31:56.9431135Z @given( 2025-05-07T20:31:56.9431367Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.9431687Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.9431994Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.9432330Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.9432670Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.9432959Z ) 2025-05-07T20:31:56.9433310Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.9433900Z def test_silu_mul_quant( 2025-05-07T20:31:56.9434146Z self, 2025-05-07T20:31:56.9434337Z T: int, 2025-05-07T20:31:56.9434537Z D: int, 2025-05-07T20:31:56.9434765Z scale_ub: Optional[float], 2025-05-07T20:31:56.9435031Z contiguous: bool, 2025-05-07T20:31:56.9435279Z compiled: bool, 2025-05-07T20:31:56.9435505Z ) -> None: 2025-05-07T20:31:56.9435714Z torch.manual_seed(2025) 2025-05-07T20:31:56.9435958Z 2025-05-07T20:31:56.9436227Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.9438351Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:56.9440298Z 2025-05-07T20:31:56.9440423Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:56.9440635Z 2025-05-07T20:31:56.9440739Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.9441164Z self=, 2025-05-07T20:31:56.9441580Z T=2048, 2025-05-07T20:31:56.9441765Z D=5120, 2025-05-07T20:31:56.9441954Z scale_ub=None, 2025-05-07T20:31:56.9442167Z contiguous=False, 2025-05-07T20:31:56.9442389Z compiled=False, 2025-05-07T20:31:56.9442592Z ) 2025-05-07T20:31:56.9442912Z self = 2025-05-07T20:31:56.9443418Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:56.9443709Z 2025-05-07T20:31:56.9443786Z @given( 2025-05-07T20:31:56.9444018Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.9444336Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.9444730Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.9445070Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.9445402Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.9445702Z ) 2025-05-07T20:31:56.9446055Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.9446506Z def test_silu_mul_quant( 2025-05-07T20:31:56.9446755Z self, 2025-05-07T20:31:56.9446946Z T: int, 2025-05-07T20:31:56.9447143Z D: int, 2025-05-07T20:31:56.9447365Z scale_ub: Optional[float], 2025-05-07T20:31:56.9447637Z contiguous: bool, 2025-05-07T20:31:56.9447883Z compiled: bool, 2025-05-07T20:31:56.9448107Z ) -> None: 2025-05-07T20:31:56.9448324Z torch.manual_seed(2025) 2025-05-07T20:31:56.9448569Z 2025-05-07T20:31:56.9448844Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.9450969Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:56.9452992Z 2025-05-07T20:31:56.9453117Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:56.9453334Z 2025-05-07T20:31:56.9453437Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.9453857Z self=, 2025-05-07T20:31:56.9454387Z T=4096, 2025-05-07T20:31:56.9454566Z D=7168, 2025-05-07T20:31:56.9454753Z scale_ub=None, 2025-05-07T20:31:56.9454970Z contiguous=True, 2025-05-07T20:31:56.9455195Z compiled=True, 2025-05-07T20:31:56.9455405Z ) 2025-05-07T20:31:56.9455729Z self = 2025-05-07T20:31:56.9456236Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:56.9456514Z 2025-05-07T20:31:56.9456593Z @given( 2025-05-07T20:31:56.9456825Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.9457144Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.9457450Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.9457786Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.9458119Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.9458408Z ) 2025-05-07T20:31:56.9458765Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.9459215Z def test_silu_mul_quant( 2025-05-07T20:31:56.9459460Z self, 2025-05-07T20:31:56.9459657Z T: int, 2025-05-07T20:31:56.9459851Z D: int, 2025-05-07T20:31:56.9460072Z scale_ub: Optional[float], 2025-05-07T20:31:56.9460342Z contiguous: bool, 2025-05-07T20:31:56.9460582Z compiled: bool, 2025-05-07T20:31:56.9460811Z ) -> None: 2025-05-07T20:31:56.9461022Z torch.manual_seed(2025) 2025-05-07T20:31:56.9461270Z 2025-05-07T20:31:56.9461544Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.9463751Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:56.9465691Z 2025-05-07T20:31:56.9465812Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:56.9466031Z 2025-05-07T20:31:56.9466133Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.9466552Z self=, 2025-05-07T20:31:56.9466967Z T=2048, 2025-05-07T20:31:56.9467153Z D=5120, 2025-05-07T20:31:56.9467349Z scale_ub=1200.0, 2025-05-07T20:31:56.9467576Z contiguous=False, 2025-05-07T20:31:56.9467800Z compiled=False, 2025-05-07T20:31:56.9468006Z ) 2025-05-07T20:31:56.9468334Z self = 2025-05-07T20:31:56.9468842Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:56.9469137Z 2025-05-07T20:31:56.9469214Z @given( 2025-05-07T20:31:56.9469447Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.9469765Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.9470074Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.9470410Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.9470743Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.9471028Z ) 2025-05-07T20:31:56.9471383Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.9471836Z def test_silu_mul_quant( 2025-05-07T20:31:56.9472074Z self, 2025-05-07T20:31:56.9472274Z T: int, 2025-05-07T20:31:56.9472468Z D: int, 2025-05-07T20:31:56.9472679Z scale_ub: Optional[float], 2025-05-07T20:31:56.9472955Z contiguous: bool, 2025-05-07T20:31:56.9473196Z compiled: bool, 2025-05-07T20:31:56.9473499Z ) -> None: 2025-05-07T20:31:56.9473714Z torch.manual_seed(2025) 2025-05-07T20:31:56.9473957Z 2025-05-07T20:31:56.9474224Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.9476352Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:56.9478279Z 2025-05-07T20:31:56.9478398Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:56.9478621Z 2025-05-07T20:31:56.9478729Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.9479152Z self=, 2025-05-07T20:31:56.9479559Z T=4096, 2025-05-07T20:31:56.9479743Z D=7168, 2025-05-07T20:31:56.9479938Z scale_ub=1200.0, 2025-05-07T20:31:56.9480156Z contiguous=True, 2025-05-07T20:31:56.9480379Z compiled=False, 2025-05-07T20:31:56.9480583Z ) 2025-05-07T20:31:57.0522762Z self = 2025-05-07T20:31:57.0523543Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:57.0523950Z 2025-05-07T20:31:57.0524055Z @given( 2025-05-07T20:31:57.0524362Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.0524759Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.0525069Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.0525404Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.0525754Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.0526037Z ) 2025-05-07T20:31:57.0526388Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.0527009Z def test_silu_mul_quant( 2025-05-07T20:31:57.0527251Z self, 2025-05-07T20:31:57.0527449Z T: int, 2025-05-07T20:31:57.0527645Z D: int, 2025-05-07T20:31:57.0527860Z scale_ub: Optional[float], 2025-05-07T20:31:57.0528138Z contiguous: bool, 2025-05-07T20:31:57.0528384Z compiled: bool, 2025-05-07T20:31:57.0528613Z ) -> None: 2025-05-07T20:31:57.0528832Z torch.manual_seed(2025) 2025-05-07T20:31:57.0529081Z 2025-05-07T20:31:57.0529352Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.0531536Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.0533589Z 2025-05-07T20:31:57.0533711Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:57.0533936Z 2025-05-07T20:31:57.0534041Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.0534467Z self=, 2025-05-07T20:31:57.0534881Z T=16384, 2025-05-07T20:31:57.0535085Z D=7168, 2025-05-07T20:31:57.0535283Z scale_ub=None, 2025-05-07T20:31:57.0535498Z contiguous=False, 2025-05-07T20:31:57.0535734Z compiled=True, 2025-05-07T20:31:57.0535945Z ) 2025-05-07T20:31:57.0536271Z self = 2025-05-07T20:31:57.0536909Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:57.0537195Z 2025-05-07T20:31:57.0537278Z @given( 2025-05-07T20:31:57.0537519Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.0537837Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.0538159Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.0538495Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.0538827Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.0539122Z ) 2025-05-07T20:31:57.0539480Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.0539947Z def test_silu_mul_quant( 2025-05-07T20:31:57.0540191Z self, 2025-05-07T20:31:57.0540389Z T: int, 2025-05-07T20:31:57.0540594Z D: int, 2025-05-07T20:31:57.0540812Z scale_ub: Optional[float], 2025-05-07T20:31:57.0541089Z contiguous: bool, 2025-05-07T20:31:57.0541334Z compiled: bool, 2025-05-07T20:31:57.0541556Z ) -> None: 2025-05-07T20:31:57.0541772Z torch.manual_seed(2025) 2025-05-07T20:31:57.0542024Z 2025-05-07T20:31:57.0542307Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.0544434Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.0546372Z 2025-05-07T20:31:57.0546495Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:57.0546724Z 2025-05-07T20:31:57.0546829Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.0547253Z self=, 2025-05-07T20:31:57.0547747Z T=4096, 2025-05-07T20:31:57.0547944Z D=7168, 2025-05-07T20:31:57.0548141Z scale_ub=None, 2025-05-07T20:31:57.0548352Z contiguous=True, 2025-05-07T20:31:57.0548581Z compiled=False, 2025-05-07T20:31:57.0548795Z ) 2025-05-07T20:31:57.0549119Z self = 2025-05-07T20:31:57.0549636Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:57.0549918Z 2025-05-07T20:31:57.0550005Z @given( 2025-05-07T20:31:57.0550242Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.0550557Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.0550869Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.0551218Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.0551551Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.0551841Z ) 2025-05-07T20:31:57.0552202Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.0552651Z def test_silu_mul_quant( 2025-05-07T20:31:57.0552893Z self, 2025-05-07T20:31:57.0553088Z T: int, 2025-05-07T20:31:57.0553282Z D: int, 2025-05-07T20:31:57.0553500Z scale_ub: Optional[float], 2025-05-07T20:31:57.0553772Z contiguous: bool, 2025-05-07T20:31:57.0554012Z compiled: bool, 2025-05-07T20:31:57.0554236Z ) -> None: 2025-05-07T20:31:57.0554456Z torch.manual_seed(2025) 2025-05-07T20:31:57.0554697Z 2025-05-07T20:31:57.0554970Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.0557093Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.0559114Z 2025-05-07T20:31:57.0559235Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:57.0559452Z 2025-05-07T20:31:57.0559558Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.0559980Z self=, 2025-05-07T20:31:57.0560398Z T=16384, 2025-05-07T20:31:57.0560592Z D=7168, 2025-05-07T20:31:57.0560779Z scale_ub=None, 2025-05-07T20:31:57.0560998Z contiguous=True, 2025-05-07T20:31:57.0561229Z compiled=False, 2025-05-07T20:31:57.0561444Z ) 2025-05-07T20:31:57.0561769Z self = 2025-05-07T20:31:57.0562284Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:57.0562569Z 2025-05-07T20:31:57.0562659Z @given( 2025-05-07T20:31:57.0562888Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.0563211Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.0563531Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.0563877Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.0564214Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.0571434Z ) 2025-05-07T20:31:57.0571894Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.0572359Z def test_silu_mul_quant( 2025-05-07T20:31:57.0572626Z self, 2025-05-07T20:31:57.0572825Z T: int, 2025-05-07T20:31:57.0573027Z D: int, 2025-05-07T20:31:57.0573256Z scale_ub: Optional[float], 2025-05-07T20:31:57.0573529Z contiguous: bool, 2025-05-07T20:31:57.0573772Z compiled: bool, 2025-05-07T20:31:57.0573998Z ) -> None: 2025-05-07T20:31:57.0574321Z torch.manual_seed(2025) 2025-05-07T20:31:57.0574574Z 2025-05-07T20:31:57.0574857Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.0576996Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.0578949Z 2025-05-07T20:31:57.0579076Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:57.0579295Z 2025-05-07T20:31:57.0579405Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.0579841Z self=, 2025-05-07T20:31:57.0580250Z T=16384, 2025-05-07T20:31:57.0580444Z D=7168, 2025-05-07T20:31:57.0580637Z scale_ub=1200.0, 2025-05-07T20:31:57.0580861Z contiguous=True, 2025-05-07T20:31:57.0581086Z compiled=False, 2025-05-07T20:31:57.0581294Z ) 2025-05-07T20:31:57.0581617Z self = 2025-05-07T20:31:57.0582130Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:57.0582418Z 2025-05-07T20:31:57.0582502Z @given( 2025-05-07T20:31:57.0582730Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.0583051Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.0583454Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.0583795Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.0584129Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.0584419Z ) 2025-05-07T20:31:57.0584787Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.0585242Z def test_silu_mul_quant( 2025-05-07T20:31:57.0585488Z self, 2025-05-07T20:31:57.0585685Z T: int, 2025-05-07T20:31:57.0585883Z D: int, 2025-05-07T20:31:57.0586106Z scale_ub: Optional[float], 2025-05-07T20:31:57.0586385Z contiguous: bool, 2025-05-07T20:31:57.0586626Z compiled: bool, 2025-05-07T20:31:57.0586852Z ) -> None: 2025-05-07T20:31:57.0587075Z torch.manual_seed(2025) 2025-05-07T20:31:57.0587322Z 2025-05-07T20:31:57.0587600Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.0589740Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.0591742Z 2025-05-07T20:31:57.0591865Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:57.0592082Z 2025-05-07T20:31:57.0592194Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.0592612Z self=, 2025-05-07T20:31:57.0593029Z T=128, 2025-05-07T20:31:57.0593215Z D=5120, 2025-05-07T20:31:57.0593408Z scale_ub=1200.0, 2025-05-07T20:31:57.0593641Z contiguous=False, 2025-05-07T20:31:57.0593872Z compiled=False, 2025-05-07T20:31:57.0594077Z ) 2025-05-07T20:31:57.1843329Z self = 2025-05-07T20:31:57.1845179Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:57.1845920Z 2025-05-07T20:31:57.1846084Z @given( 2025-05-07T20:31:57.1846552Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.1847172Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.1847781Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.1848438Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.1849081Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.1849655Z ) 2025-05-07T20:31:57.1850361Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.1851253Z def test_silu_mul_quant( 2025-05-07T20:31:57.1851645Z self, 2025-05-07T20:31:57.1851964Z T: int, 2025-05-07T20:31:57.1852168Z D: int, 2025-05-07T20:31:57.1852390Z scale_ub: Optional[float], 2025-05-07T20:31:57.1852667Z contiguous: bool, 2025-05-07T20:31:57.1852918Z compiled: bool, 2025-05-07T20:31:57.1853145Z ) -> None: 2025-05-07T20:31:57.1853373Z torch.manual_seed(2025) 2025-05-07T20:31:57.1853624Z 2025-05-07T20:31:57.1853902Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.1854253Z 2025-05-07T20:31:57.1854457Z x_sign = torch.sign(x) 2025-05-07T20:31:57.1854748Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.1855070Z x = x_sign * x_clamp 2025-05-07T20:31:57.1855320Z x0 = x[:, :D] 2025-05-07T20:31:57.1855539Z x1 = x[:, D:] 2025-05-07T20:31:57.1855757Z 2025-05-07T20:31:57.1855953Z if contiguous: 2025-05-07T20:31:57.1856191Z x0 = x0.contiguous() 2025-05-07T20:31:57.1856601Z x1 = x1.contiguous() 2025-05-07T20:31:57.1856847Z 2025-05-07T20:31:57.1857050Z if scale_ub is not None: 2025-05-07T20:31:57.1857323Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.1857671Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.1857987Z ) 2025-05-07T20:31:57.1858179Z else: 2025-05-07T20:31:57.1858401Z scale_ub_tensor = None 2025-05-07T20:31:57.1858660Z 2025-05-07T20:31:57.1858894Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.1859216Z op = silu_mul_quant 2025-05-07T20:31:57.1859478Z if compiled: 2025-05-07T20:31:57.1859731Z op = torch.compile(op) 2025-05-07T20:31:57.1860039Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.1860321Z 2025-05-07T20:31:57.1860520Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.1860689Z 2025-05-07T20:31:57.1860793Z moe/activation_test.py:117: 2025-05-07T20:31:57.1861109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.1861455Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.1861743Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.1862467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.1863184Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.1863734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.1864438Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.1865122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.1865673Z kernel = self.compile( 2025-05-07T20:31:57.1866229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.1866910Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.1867325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.1867562Z 2025-05-07T20:31:57.1867863Z self = 2025-05-07T20:31:57.1868985Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.1870413Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4474ab2fc0>} 2025-05-07T20:31:57.1871808Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.1872876Z context = 2025-05-07T20:31:57.1873174Z 2025-05-07T20:31:57.1873350Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.1873891Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.1874378Z module_map=module_map) 2025-05-07T20:31:57.1874753Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.1875113Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.1875379Z E ^ 2025-05-07T20:31:57.1875857Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.1876320Z 2025-05-07T20:31:57.1876753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.1877369Z 2025-05-07T20:31:57.1877475Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.1877905Z self=, 2025-05-07T20:31:57.1878325Z T=2048, 2025-05-07T20:31:57.1878523Z D=7168, 2025-05-07T20:31:57.1878721Z scale_ub=None, 2025-05-07T20:31:57.1878950Z contiguous=False, 2025-05-07T20:31:57.1879179Z compiled=False, 2025-05-07T20:31:57.1879391Z ) 2025-05-07T20:31:57.1879723Z self = 2025-05-07T20:31:57.1880236Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:57.1880525Z 2025-05-07T20:31:57.1880605Z @given( 2025-05-07T20:31:57.1880841Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.1881160Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.1881474Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.1881813Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.1882157Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.1882446Z ) 2025-05-07T20:31:57.1882807Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.1883267Z def test_silu_mul_quant( 2025-05-07T20:31:57.1883510Z self, 2025-05-07T20:31:57.1883707Z T: int, 2025-05-07T20:31:57.1883913Z D: int, 2025-05-07T20:31:57.1884132Z scale_ub: Optional[float], 2025-05-07T20:31:57.1884409Z contiguous: bool, 2025-05-07T20:31:57.1884657Z compiled: bool, 2025-05-07T20:31:57.1884886Z ) -> None: 2025-05-07T20:31:57.1885108Z torch.manual_seed(2025) 2025-05-07T20:31:57.1885356Z 2025-05-07T20:31:57.1885632Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.1887857Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.1889804Z 2025-05-07T20:31:57.1889928Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:57.1890150Z 2025-05-07T20:31:57.1890253Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.1890676Z self=, 2025-05-07T20:31:57.1891090Z T=128, 2025-05-07T20:31:57.1891279Z D=7168, 2025-05-07T20:31:57.1891476Z scale_ub=1200.0, 2025-05-07T20:31:57.1891701Z contiguous=True, 2025-05-07T20:31:57.1891984Z compiled=True, 2025-05-07T20:31:57.1892193Z ) 2025-05-07T20:31:57.2193605Z self = 2025-05-07T20:31:57.2195061Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:57.2195799Z 2025-05-07T20:31:57.2196016Z @given( 2025-05-07T20:31:57.2196640Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.2197280Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.2197910Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.2198575Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.2199243Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.2199817Z ) 2025-05-07T20:31:57.2200520Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.2201390Z def test_silu_mul_quant( 2025-05-07T20:31:57.2201638Z self, 2025-05-07T20:31:57.2201844Z T: int, 2025-05-07T20:31:57.2202045Z D: int, 2025-05-07T20:31:57.2202274Z scale_ub: Optional[float], 2025-05-07T20:31:57.2202755Z contiguous: bool, 2025-05-07T20:31:57.2203001Z compiled: bool, 2025-05-07T20:31:57.2203230Z ) -> None: 2025-05-07T20:31:57.2203453Z torch.manual_seed(2025) 2025-05-07T20:31:57.2203704Z 2025-05-07T20:31:57.2203986Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.2204341Z 2025-05-07T20:31:57.2204542Z x_sign = torch.sign(x) 2025-05-07T20:31:57.2204837Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.2205159Z x = x_sign * x_clamp 2025-05-07T20:31:57.2205402Z x0 = x[:, :D] 2025-05-07T20:31:57.2205619Z x1 = x[:, D:] 2025-05-07T20:31:57.2205834Z 2025-05-07T20:31:57.2206026Z if contiguous: 2025-05-07T20:31:57.2206451Z x0 = x0.contiguous() 2025-05-07T20:31:57.2206722Z x1 = x1.contiguous() 2025-05-07T20:31:57.2206961Z 2025-05-07T20:31:57.2207153Z if scale_ub is not None: 2025-05-07T20:31:57.2207443Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.2207790Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.2208109Z ) 2025-05-07T20:31:57.2208311Z else: 2025-05-07T20:31:57.2208536Z scale_ub_tensor = None 2025-05-07T20:31:57.2208789Z 2025-05-07T20:31:57.2209030Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.2209352Z op = silu_mul_quant 2025-05-07T20:31:57.2209612Z if compiled: 2025-05-07T20:31:57.2209863Z op = torch.compile(op) 2025-05-07T20:31:57.2210166Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.2210449Z 2025-05-07T20:31:57.2210639Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.2210811Z 2025-05-07T20:31:57.2210913Z moe/activation_test.py:117: 2025-05-07T20:31:57.2211224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.2211558Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.2211908Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.2212492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.2213199Z return fn(*args, **kwargs) 2025-05-07T20:31:57.2213880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.2214602Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.2215162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.2215865Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.2216551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.2217103Z kernel = self.compile( 2025-05-07T20:31:57.2217667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.2218344Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.2218762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.2219004Z 2025-05-07T20:31:57.2219226Z self = 2025-05-07T20:31:57.2220348Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.2221768Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44747c5120>} 2025-05-07T20:31:57.2223159Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.2224336Z context = 2025-05-07T20:31:57.2224632Z 2025-05-07T20:31:57.2224815Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.2225349Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.2225830Z module_map=module_map) 2025-05-07T20:31:57.2226203Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.2226565Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.2226824Z E ^ 2025-05-07T20:31:57.2227299Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.2227765Z 2025-05-07T20:31:57.2228197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.2228732Z 2025-05-07T20:31:57.2228843Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.2229265Z self=, 2025-05-07T20:31:57.2229685Z T=128, 2025-05-07T20:31:57.2229877Z D=7168, 2025-05-07T20:31:57.2230071Z scale_ub=1200.0, 2025-05-07T20:31:57.2230296Z contiguous=True, 2025-05-07T20:31:57.2230522Z compiled=False, 2025-05-07T20:31:57.2230726Z ) 2025-05-07T20:31:57.2231055Z self = 2025-05-07T20:31:57.2231561Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:57.2231840Z 2025-05-07T20:31:57.2231918Z @given( 2025-05-07T20:31:57.2232151Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.2232469Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.2232781Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.2233120Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.2233456Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.2233748Z ) 2025-05-07T20:31:57.2234182Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.2234637Z def test_silu_mul_quant( 2025-05-07T20:31:57.2234884Z self, 2025-05-07T20:31:57.2235075Z T: int, 2025-05-07T20:31:57.2235276Z D: int, 2025-05-07T20:31:57.2235496Z scale_ub: Optional[float], 2025-05-07T20:31:57.2235766Z contiguous: bool, 2025-05-07T20:31:57.2236009Z compiled: bool, 2025-05-07T20:31:57.2236233Z ) -> None: 2025-05-07T20:31:57.2236447Z torch.manual_seed(2025) 2025-05-07T20:31:57.2236692Z 2025-05-07T20:31:57.2236974Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.2237317Z 2025-05-07T20:31:57.2237518Z x_sign = torch.sign(x) 2025-05-07T20:31:57.2237825Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.2239910Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.2241836Z 2025-05-07T20:31:57.2241963Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:57.2242180Z 2025-05-07T20:31:57.2242283Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.2242705Z self=, 2025-05-07T20:31:57.2243121Z T=128, 2025-05-07T20:31:57.2243398Z D=5120, 2025-05-07T20:31:57.2243593Z scale_ub=1200.0, 2025-05-07T20:31:57.2243817Z contiguous=True, 2025-05-07T20:31:57.2244044Z compiled=True, 2025-05-07T20:31:57.2244243Z ) 2025-05-07T20:31:57.2244576Z self = 2025-05-07T20:31:57.2245082Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:57.2245355Z 2025-05-07T20:31:57.2245433Z @given( 2025-05-07T20:31:57.2245666Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.2245984Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.2246290Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.2246625Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.2246962Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.2247254Z ) 2025-05-07T20:31:57.2247608Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.2248068Z def test_silu_mul_quant( 2025-05-07T20:31:57.2248313Z self, 2025-05-07T20:31:57.2248506Z T: int, 2025-05-07T20:31:57.2248707Z D: int, 2025-05-07T20:31:57.2248933Z scale_ub: Optional[float], 2025-05-07T20:31:57.2249205Z contiguous: bool, 2025-05-07T20:31:57.2249446Z compiled: bool, 2025-05-07T20:31:57.2249688Z ) -> None: 2025-05-07T20:31:57.2249904Z torch.manual_seed(2025) 2025-05-07T20:31:57.2250149Z 2025-05-07T20:31:57.2250428Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.2250768Z 2025-05-07T20:31:57.2250963Z > x_sign = torch.sign(x) 2025-05-07T20:31:57.2253113Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.2255049Z 2025-05-07T20:31:57.2255168Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:57.2255385Z 2025-05-07T20:31:57.2255489Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.2255905Z self=, 2025-05-07T20:31:57.2256318Z T=128, 2025-05-07T20:31:57.2256507Z D=7168, 2025-05-07T20:31:57.2256694Z scale_ub=None, 2025-05-07T20:31:57.2256913Z contiguous=True, 2025-05-07T20:31:57.2257134Z compiled=True, 2025-05-07T20:31:57.2257330Z ) 2025-05-07T20:31:57.5736031Z self = 2025-05-07T20:31:57.5736762Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:57.5737108Z 2025-05-07T20:31:57.5737188Z @given( 2025-05-07T20:31:57.5737425Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.5737742Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.5738053Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.5738388Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.5738713Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.5738999Z ) 2025-05-07T20:31:57.5739355Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.5739817Z def test_silu_mul_quant( 2025-05-07T20:31:57.5740062Z self, 2025-05-07T20:31:57.5740261Z T: int, 2025-05-07T20:31:57.5740467Z D: int, 2025-05-07T20:31:57.5740688Z scale_ub: Optional[float], 2025-05-07T20:31:57.5740964Z contiguous: bool, 2025-05-07T20:31:57.5741210Z compiled: bool, 2025-05-07T20:31:57.5741619Z ) -> None: 2025-05-07T20:31:57.5741840Z torch.manual_seed(2025) 2025-05-07T20:31:57.5742093Z 2025-05-07T20:31:57.5742370Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.5744510Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.5746455Z 2025-05-07T20:31:57.5746576Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:57.5746800Z 2025-05-07T20:31:57.5830314Z FAILED 2025-05-07T20:31:57.5830520Z 2025-05-07T20:31:57.5830718Z =================================== FAILURES =================================== 2025-05-07T20:31:57.5831330Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:31:57.5831958Z + Exception Group Traceback (most recent call last): 2025-05-07T20:31:57.5832834Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:31:57.5833594Z | yield 2025-05-07T20:31:57.5834201Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run 2025-05-07T20:31:57.5834938Z | self._callTestMethod(testMethod) 2025-05-07T20:31:57.5835721Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod 2025-05-07T20:31:57.5836500Z | if method() is not None: 2025-05-07T20:31:57.5836844Z | ^^^^^^^^ 2025-05-07T20:31:57.5837735Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:31:57.5838775Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.5850170Z | ^^^^^^^ 2025-05-07T20:31:57.5851028Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:31:57.5852003Z | raise the_error_hypothesis_found 2025-05-07T20:31:57.5852617Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:31:57.5853210Z +-+---------------- 1 ---------------- 2025-05-07T20:31:57.5853603Z | Traceback (most recent call last): 2025-05-07T20:31:57.5854601Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:31:57.5855704Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.5856237Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:57.5859042Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.5861609Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:57.5862069Z | self=, 2025-05-07T20:31:57.5862495Z | T=128, 2025-05-07T20:31:57.5862709Z | D=7168, 2025-05-07T20:31:57.5863430Z | scale_ub=1200.0, 2025-05-07T20:31:57.5863679Z | contiguous=True, 2025-05-07T20:31:57.5863924Z | compiled=False, 2025-05-07T20:31:57.5864153Z | ) 2025-05-07T20:31:57.5864337Z | 2025-05-07T20:31:57.5864883Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case 2025-05-07T20:31:57.5865510Z +---------------- 2 ---------------- 2025-05-07T20:31:57.5865809Z | Traceback (most recent call last): 2025-05-07T20:31:57.5866537Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:31:57.5867327Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.5867703Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:57.5869757Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.5871809Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:57.5872260Z | self=, 2025-05-07T20:31:57.5872679Z | T=128, 2025-05-07T20:31:57.5872878Z | D=7168, 2025-05-07T20:31:57.5873091Z | scale_ub=None, 2025-05-07T20:31:57.5873330Z | contiguous=True, 2025-05-07T20:31:57.5873568Z | compiled=True, 2025-05-07T20:31:57.5873793Z | ) 2025-05-07T20:31:57.5873976Z | 2025-05-07T20:31:57.5874512Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:31:57.5875135Z +---------------- 3 ---------------- 2025-05-07T20:31:57.5875518Z | Traceback (most recent call last): 2025-05-07T20:31:57.5876244Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:31:57.5877044Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.5877426Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:57.5879490Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.5882200Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:57.5882810Z | self=, 2025-05-07T20:31:57.5883378Z | T=128, 2025-05-07T20:31:57.5883656Z | D=5120, 2025-05-07T20:31:57.5883937Z | scale_ub=1200.0, 2025-05-07T20:31:57.5884290Z | contiguous=True, 2025-05-07T20:31:57.5884621Z | compiled=True, 2025-05-07T20:31:57.5884934Z | ) 2025-05-07T20:31:57.5885168Z | 2025-05-07T20:31:57.5885896Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:31:57.5886745Z +---------------- 4 ---------------- 2025-05-07T20:31:57.5887248Z | Traceback (most recent call last): 2025-05-07T20:31:57.5888280Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:31:57.5889283Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:57.5889685Z | ^^^^^^^^ 2025-05-07T20:31:57.5890574Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:31:57.5891537Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.5892099Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:57.5893211Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:31:57.5894333Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:57.5895180Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:31:57.5896212Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.5896806Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:57.5897467Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:31:57.5898264Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:57.5898741Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:57.5899396Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:31:57.5900119Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:57.5900497Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:57.5901206Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:31:57.5901788Z | fn() 2025-05-07T20:31:57.5902376Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:31:57.5903027Z | self.fn.run( 2025-05-07T20:31:57.5903568Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:31:57.5904293Z | kernel = self.compile( 2025-05-07T20:31:57.5904649Z | ^^^^^^^^^^^^^ 2025-05-07T20:31:57.5905418Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:31:57.5906351Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.5906750Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:57.5907422Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:57.5908244Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.5908729Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:57.5909119Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.5909480Z | def _kernel_quantize_fp8_row( 2025-05-07T20:31:57.5909746Z | ^ 2025-05-07T20:31:57.5910221Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.5910962Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:57.5911374Z | # The test always failed when commented parts were varied together. 2025-05-07T20:31:57.5911906Z | self=, 2025-05-07T20:31:57.5912360Z | T=1, # or any other generated value 2025-05-07T20:31:57.5912677Z | D=5120, # or any other generated value 2025-05-07T20:31:57.5913017Z | scale_ub=None, # or any other generated value 2025-05-07T20:31:57.5913389Z | contiguous=True, # or any other generated value 2025-05-07T20:31:57.5913762Z | compiled=True, # or any other generated value 2025-05-07T20:31:57.5914064Z | ) 2025-05-07T20:31:57.5914247Z | 2025-05-07T20:31:57.5914789Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:31:57.5915412Z +------------------------------------ 2025-05-07T20:31:57.5915787Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:31:57.5916173Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.5916595Z self=, 2025-05-07T20:31:57.5917007Z T=1, 2025-05-07T20:31:57.5917194Z D=5120, 2025-05-07T20:31:57.5917451Z scale_ub=None, 2025-05-07T20:31:57.5917734Z contiguous=True, 2025-05-07T20:31:57.5918042Z compiled=True, 2025-05-07T20:31:57.5918324Z ) 2025-05-07T20:31:57.5918759Z self = 2025-05-07T20:31:57.5919429Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:57.5919798Z 2025-05-07T20:31:57.5919903Z @given( 2025-05-07T20:31:57.5920218Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.5938101Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.5938591Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.5939120Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.5939579Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.5939997Z ) 2025-05-07T20:31:57.5940909Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.5941539Z def test_silu_mul_quant( 2025-05-07T20:31:57.5941889Z self, 2025-05-07T20:31:57.5942165Z T: int, 2025-05-07T20:31:57.5942439Z D: int, 2025-05-07T20:31:57.5942776Z scale_ub: Optional[float], 2025-05-07T20:31:57.5943166Z contiguous: bool, 2025-05-07T20:31:57.5943510Z compiled: bool, 2025-05-07T20:31:57.5943824Z ) -> None: 2025-05-07T20:31:57.5944136Z torch.manual_seed(2025) 2025-05-07T20:31:57.5944481Z 2025-05-07T20:31:57.5944855Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.5945345Z 2025-05-07T20:31:57.5945627Z x_sign = torch.sign(x) 2025-05-07T20:31:57.5946048Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.5946482Z x = x_sign * x_clamp 2025-05-07T20:31:57.5946829Z x0 = x[:, :D] 2025-05-07T20:31:57.5947141Z x1 = x[:, D:] 2025-05-07T20:31:57.5947434Z 2025-05-07T20:31:57.5947709Z if contiguous: 2025-05-07T20:31:57.5948042Z x0 = x0.contiguous() 2025-05-07T20:31:57.5948403Z x1 = x1.contiguous() 2025-05-07T20:31:57.5948747Z 2025-05-07T20:31:57.5949025Z if scale_ub is not None: 2025-05-07T20:31:57.5949405Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.5949867Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.5950297Z ) 2025-05-07T20:31:57.5950575Z else: 2025-05-07T20:31:57.5950867Z scale_ub_tensor = None 2025-05-07T20:31:57.5951234Z 2025-05-07T20:31:57.5951606Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.5952072Z op = silu_mul_quant 2025-05-07T20:31:57.5952572Z if compiled: 2025-05-07T20:31:57.5952932Z op = torch.compile(op) 2025-05-07T20:31:57.5953333Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.5953712Z 2025-05-07T20:31:57.5953993Z y_fp8, y_scale = fn() 2025-05-07T20:31:57.5954390Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:57.5954804Z 2025-05-07T20:31:57.5955143Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.5955612Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:57.5956036Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:57.5956482Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:57.5956985Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.5957434Z 2025-05-07T20:31:57.5957721Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:57.5957995Z 2025-05-07T20:31:57.5958143Z moe/activation_test.py:126: 2025-05-07T20:31:57.5958567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.5959042Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:57.5959507Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.5960593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:57.5961656Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:57.5962430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.5963369Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.5964307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:57.5965335Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:57.5966411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:57.5967351Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:57.5968285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:57.5969064Z fn() 2025-05-07T20:31:57.5969784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:57.5970580Z self.fn.run( 2025-05-07T20:31:57.5971207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.5972029Z kernel = self.compile( 2025-05-07T20:31:57.5972750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.5973638Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.5974190Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.5974509Z 2025-05-07T20:31:57.5974794Z self = 2025-05-07T20:31:57.5976283Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.5978604Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4486d9fba0>} 2025-05-07T20:31:57.5980488Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.5981929Z context = 2025-05-07T20:31:57.5982426Z 2025-05-07T20:31:57.5982654Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.5983356Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.5983999Z module_map=module_map) 2025-05-07T20:31:57.5984488Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.5984953Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:57.5985331Z E ^ 2025-05-07T20:31:57.5985993Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.5986591Z 2025-05-07T20:31:57.5987178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.5987898Z 2025-05-07T20:31:57.5988027Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.5988564Z self=, 2025-05-07T20:31:57.5989116Z T=2048, 2025-05-07T20:31:57.5989356Z D=5120, 2025-05-07T20:31:57.5989617Z scale_ub=1200.0, 2025-05-07T20:31:57.5989930Z contiguous=True, 2025-05-07T20:31:57.5990230Z compiled=False, 2025-05-07T20:31:57.5990498Z ) 2025-05-07T20:31:57.5990925Z self = 2025-05-07T20:31:57.5991593Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:57.5991966Z 2025-05-07T20:31:57.5992068Z @given( 2025-05-07T20:31:57.5992372Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.5992793Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.5993202Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.5993647Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.5994088Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.5994471Z ) 2025-05-07T20:31:57.5994944Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.5995540Z def test_silu_mul_quant( 2025-05-07T20:31:57.5995869Z self, 2025-05-07T20:31:57.5996216Z T: int, 2025-05-07T20:31:57.5996486Z D: int, 2025-05-07T20:31:57.5996780Z scale_ub: Optional[float], 2025-05-07T20:31:57.5997132Z contiguous: bool, 2025-05-07T20:31:57.5997454Z compiled: bool, 2025-05-07T20:31:57.5997755Z ) -> None: 2025-05-07T20:31:57.5998039Z torch.manual_seed(2025) 2025-05-07T20:31:57.5998367Z 2025-05-07T20:31:57.5998741Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.5999204Z 2025-05-07T20:31:57.5999475Z x_sign = torch.sign(x) 2025-05-07T20:31:57.5999875Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6000302Z x = x_sign * x_clamp 2025-05-07T20:31:57.6000635Z x0 = x[:, :D] 2025-05-07T20:31:57.6000942Z x1 = x[:, D:] 2025-05-07T20:31:57.6001231Z 2025-05-07T20:31:57.6001500Z if contiguous: 2025-05-07T20:31:57.6001829Z x0 = x0.contiguous() 2025-05-07T20:31:57.6002197Z x1 = x1.contiguous() 2025-05-07T20:31:57.6002520Z 2025-05-07T20:31:57.6002786Z if scale_ub is not None: 2025-05-07T20:31:57.6003159Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6003609Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6004035Z ) 2025-05-07T20:31:57.6004299Z else: 2025-05-07T20:31:57.6004579Z scale_ub_tensor = None 2025-05-07T20:31:57.6004925Z 2025-05-07T20:31:57.6005244Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6005671Z op = silu_mul_quant 2025-05-07T20:31:57.6006009Z if compiled: 2025-05-07T20:31:57.6006618Z op = torch.compile(op) 2025-05-07T20:31:57.6007204Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6007557Z 2025-05-07T20:31:57.6007801Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6008005Z 2025-05-07T20:31:57.6008138Z moe/activation_test.py:117: 2025-05-07T20:31:57.6008504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6008919Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6009283Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6010193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6011150Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6011983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6012866Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6013719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6014422Z kernel = self.compile( 2025-05-07T20:31:57.6015128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6015984Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6016490Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6016790Z 2025-05-07T20:31:57.6017043Z self = 2025-05-07T20:31:57.6018532Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6020458Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44857e9620>} 2025-05-07T20:31:57.6023989Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6025398Z context = 2025-05-07T20:31:57.6025782Z 2025-05-07T20:31:57.6026014Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6026725Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6027350Z module_map=module_map) 2025-05-07T20:31:57.6027833Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6028317Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6028661Z E ^ 2025-05-07T20:31:57.6029295Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6029925Z 2025-05-07T20:31:57.6030494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6031201Z 2025-05-07T20:31:57.6031345Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6031891Z self=, 2025-05-07T20:31:57.6032435Z T=2048, 2025-05-07T20:31:57.6032685Z D=5120, 2025-05-07T20:31:57.6032933Z scale_ub=1200.0, 2025-05-07T20:31:57.6033232Z contiguous=True, 2025-05-07T20:31:57.6033529Z compiled=True, 2025-05-07T20:31:57.6033790Z ) 2025-05-07T20:31:57.6034215Z self = 2025-05-07T20:31:57.6034887Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:57.6035265Z 2025-05-07T20:31:57.6035384Z @given( 2025-05-07T20:31:57.6035685Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6036220Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6036642Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6037095Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6037550Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6037944Z ) 2025-05-07T20:31:57.6038420Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6039030Z def test_silu_mul_quant( 2025-05-07T20:31:57.6039363Z self, 2025-05-07T20:31:57.6039626Z T: int, 2025-05-07T20:31:57.6039897Z D: int, 2025-05-07T20:31:57.6040185Z scale_ub: Optional[float], 2025-05-07T20:31:57.6040537Z contiguous: bool, 2025-05-07T20:31:57.6040861Z compiled: bool, 2025-05-07T20:31:57.6041152Z ) -> None: 2025-05-07T20:31:57.6041432Z torch.manual_seed(2025) 2025-05-07T20:31:57.6041758Z 2025-05-07T20:31:57.6042134Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6042593Z 2025-05-07T20:31:57.6042847Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6043232Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6043651Z x = x_sign * x_clamp 2025-05-07T20:31:57.6043961Z x0 = x[:, :D] 2025-05-07T20:31:57.6044270Z x1 = x[:, D:] 2025-05-07T20:31:57.6044556Z 2025-05-07T20:31:57.6044795Z if contiguous: 2025-05-07T20:31:57.6045101Z x0 = x0.contiguous() 2025-05-07T20:31:57.6045454Z x1 = x1.contiguous() 2025-05-07T20:31:57.6045771Z 2025-05-07T20:31:57.6046030Z if scale_ub is not None: 2025-05-07T20:31:57.6046392Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6046827Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6047241Z ) 2025-05-07T20:31:57.6047505Z else: 2025-05-07T20:31:57.6047777Z scale_ub_tensor = None 2025-05-07T20:31:57.6048118Z 2025-05-07T20:31:57.6048422Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6048834Z op = silu_mul_quant 2025-05-07T20:31:57.6069694Z if compiled: 2025-05-07T20:31:57.6070215Z op = torch.compile(op) 2025-05-07T20:31:57.6070632Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6071003Z 2025-05-07T20:31:57.6071266Z y_fp8, y_scale = fn() 2025-05-07T20:31:57.6071698Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:57.6072079Z 2025-05-07T20:31:57.6072394Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6072855Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:57.6073258Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:57.6073675Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:57.6074154Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.6074576Z 2025-05-07T20:31:57.6074830Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:57.6075089Z 2025-05-07T20:31:57.6075223Z moe/activation_test.py:126: 2025-05-07T20:31:57.6075628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6076078Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:57.6076520Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.6077581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:57.6078617Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:57.6079370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6080312Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6081273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:57.6082387Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:57.6083329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:57.6084180Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:57.6085015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:57.6085725Z fn() 2025-05-07T20:31:57.6086421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:57.6087222Z self.fn.run( 2025-05-07T20:31:57.6087857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6088588Z kernel = self.compile( 2025-05-07T20:31:57.6089345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6090249Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6090815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6091146Z 2025-05-07T20:31:57.6091428Z self = 2025-05-07T20:31:57.6092898Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6094326Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f448577e980>} 2025-05-07T20:31:57.6095717Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6096773Z context = 2025-05-07T20:31:57.6097167Z 2025-05-07T20:31:57.6097347Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6097923Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6098408Z module_map=module_map) 2025-05-07T20:31:57.6098784Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6099147Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:57.6099409Z E ^ 2025-05-07T20:31:57.6099880Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6100343Z 2025-05-07T20:31:57.6100776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6101312Z 2025-05-07T20:31:57.6101424Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6101848Z self=, 2025-05-07T20:31:57.6102262Z T=16384, 2025-05-07T20:31:57.6102463Z D=7168, 2025-05-07T20:31:57.6102659Z scale_ub=1200.0, 2025-05-07T20:31:57.6102892Z contiguous=False, 2025-05-07T20:31:57.6103124Z compiled=False, 2025-05-07T20:31:57.6103332Z ) 2025-05-07T20:31:57.6103680Z self = 2025-05-07T20:31:57.6104232Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:57.6104519Z 2025-05-07T20:31:57.6104606Z @given( 2025-05-07T20:31:57.6104834Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6105156Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6105469Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6105889Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6106797Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6107128Z ) 2025-05-07T20:31:57.6107499Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6108021Z def test_silu_mul_quant( 2025-05-07T20:31:57.6108283Z self, 2025-05-07T20:31:57.6108484Z T: int, 2025-05-07T20:31:57.6108692Z D: int, 2025-05-07T20:31:57.6108924Z scale_ub: Optional[float], 2025-05-07T20:31:57.6109216Z contiguous: bool, 2025-05-07T20:31:57.6109480Z compiled: bool, 2025-05-07T20:31:57.6109727Z ) -> None: 2025-05-07T20:31:57.6109955Z torch.manual_seed(2025) 2025-05-07T20:31:57.6110213Z 2025-05-07T20:31:57.6110505Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6110887Z 2025-05-07T20:31:57.6111084Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6111421Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6111802Z x = x_sign * x_clamp 2025-05-07T20:31:57.6112058Z x0 = x[:, :D] 2025-05-07T20:31:57.6112293Z x1 = x[:, D:] 2025-05-07T20:31:57.6112512Z 2025-05-07T20:31:57.6112700Z if contiguous: 2025-05-07T20:31:57.6112951Z x0 = x0.contiguous() 2025-05-07T20:31:57.6113231Z x1 = x1.contiguous() 2025-05-07T20:31:57.6113483Z 2025-05-07T20:31:57.6113682Z if scale_ub is not None: 2025-05-07T20:31:57.6113982Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6114344Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6114690Z ) 2025-05-07T20:31:57.6114890Z else: 2025-05-07T20:31:57.6115111Z scale_ub_tensor = None 2025-05-07T20:31:57.6115373Z 2025-05-07T20:31:57.6115618Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6115970Z op = silu_mul_quant 2025-05-07T20:31:57.6116235Z if compiled: 2025-05-07T20:31:57.6116499Z op = torch.compile(op) 2025-05-07T20:31:57.6116826Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6117365Z 2025-05-07T20:31:57.6117563Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6117730Z 2025-05-07T20:31:57.6117836Z moe/activation_test.py:117: 2025-05-07T20:31:57.6118130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6118472Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6118759Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6119474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6120178Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6120723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6121426Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6122110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6122660Z kernel = self.compile( 2025-05-07T20:31:57.6123214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6123886Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6124288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6124530Z 2025-05-07T20:31:57.6124740Z self = 2025-05-07T20:31:57.6125856Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6127419Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44849e1620>} 2025-05-07T20:31:57.6128807Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6129861Z context = 2025-05-07T20:31:57.6130159Z 2025-05-07T20:31:57.6130328Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6130863Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6131334Z module_map=module_map) 2025-05-07T20:31:57.6131703Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6132238Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6132497Z E ^ 2025-05-07T20:31:57.6132966Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6133438Z 2025-05-07T20:31:57.6133866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6134395Z 2025-05-07T20:31:57.6134506Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6134923Z self=, 2025-05-07T20:31:57.6135336Z T=1, 2025-05-07T20:31:57.6135524Z D=7168, 2025-05-07T20:31:57.6135721Z scale_ub=None, 2025-05-07T20:31:57.6135933Z contiguous=True, 2025-05-07T20:31:57.6136158Z compiled=True, 2025-05-07T20:31:57.6136366Z ) 2025-05-07T20:31:57.6136686Z self = 2025-05-07T20:31:57.6137187Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:57.6137452Z 2025-05-07T20:31:57.6137534Z @given( 2025-05-07T20:31:57.6137762Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6138175Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6138488Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6138815Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6139153Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6139445Z ) 2025-05-07T20:31:57.6139800Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6140248Z def test_silu_mul_quant( 2025-05-07T20:31:57.6140492Z self, 2025-05-07T20:31:57.6140689Z T: int, 2025-05-07T20:31:57.6140883Z D: int, 2025-05-07T20:31:57.6141104Z scale_ub: Optional[float], 2025-05-07T20:31:57.6141391Z contiguous: bool, 2025-05-07T20:31:57.6141674Z compiled: bool, 2025-05-07T20:31:57.6141902Z ) -> None: 2025-05-07T20:31:57.6142117Z torch.manual_seed(2025) 2025-05-07T20:31:57.6142358Z 2025-05-07T20:31:57.6142642Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6142994Z 2025-05-07T20:31:57.6143183Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6143480Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6143796Z x = x_sign * x_clamp 2025-05-07T20:31:57.6144038Z x0 = x[:, :D] 2025-05-07T20:31:57.6144256Z x1 = x[:, D:] 2025-05-07T20:31:57.6144465Z 2025-05-07T20:31:57.6144650Z if contiguous: 2025-05-07T20:31:57.6144878Z x0 = x0.contiguous() 2025-05-07T20:31:57.6145143Z x1 = x1.contiguous() 2025-05-07T20:31:57.6145386Z 2025-05-07T20:31:57.6145576Z if scale_ub is not None: 2025-05-07T20:31:57.6145853Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6146195Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6147158Z ) 2025-05-07T20:31:57.6147355Z else: 2025-05-07T20:31:57.6147572Z scale_ub_tensor = None 2025-05-07T20:31:57.6147822Z 2025-05-07T20:31:57.6148066Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6148389Z op = silu_mul_quant 2025-05-07T20:31:57.6148642Z if compiled: 2025-05-07T20:31:57.6148895Z op = torch.compile(op) 2025-05-07T20:31:57.6149200Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6149473Z 2025-05-07T20:31:57.6149671Z y_fp8, y_scale = fn() 2025-05-07T20:31:57.6149968Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:57.6150263Z 2025-05-07T20:31:57.6150496Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6150833Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:57.6151132Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:57.6151453Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:57.6151828Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.6152152Z 2025-05-07T20:31:57.6152361Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:57.6152559Z 2025-05-07T20:31:57.6152661Z moe/activation_test.py:126: 2025-05-07T20:31:57.6152967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6153309Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:57.6153638Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.6154458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:57.6155236Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:57.6155798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6156498Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6157206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:57.6158043Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:57.6158795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:57.6159452Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:57.6160075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:57.6160606Z fn() 2025-05-07T20:31:57.6161122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:57.6161724Z self.fn.run( 2025-05-07T20:31:57.6162206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6162756Z kernel = self.compile( 2025-05-07T20:31:57.6163310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6163983Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6164392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6164626Z 2025-05-07T20:31:57.6164844Z self = 2025-05-07T20:31:57.6165958Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6167386Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4486da8b80>} 2025-05-07T20:31:57.6168869Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6169930Z context = 2025-05-07T20:31:57.6170224Z 2025-05-07T20:31:57.6170393Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6170930Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6171412Z module_map=module_map) 2025-05-07T20:31:57.6171873Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6172231Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:57.6172504Z E ^ 2025-05-07T20:31:57.6172982Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6173453Z 2025-05-07T20:31:57.6173886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6174419Z 2025-05-07T20:31:57.6174524Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6174945Z self=, 2025-05-07T20:31:57.6175359Z T=4096, 2025-05-07T20:31:57.6175543Z D=5120, 2025-05-07T20:31:57.6175738Z scale_ub=None, 2025-05-07T20:31:57.6175958Z contiguous=False, 2025-05-07T20:31:57.6176181Z compiled=False, 2025-05-07T20:31:57.6176391Z ) 2025-05-07T20:31:57.6176716Z self = 2025-05-07T20:31:57.6177224Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:57.6177514Z 2025-05-07T20:31:57.6177591Z @given( 2025-05-07T20:31:57.6177831Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6178149Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6178456Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6178883Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6179220Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6179505Z ) 2025-05-07T20:31:57.6179863Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6180318Z def test_silu_mul_quant( 2025-05-07T20:31:57.6180561Z self, 2025-05-07T20:31:57.6180759Z T: int, 2025-05-07T20:31:57.6180959Z D: int, 2025-05-07T20:31:57.6181174Z scale_ub: Optional[float], 2025-05-07T20:31:57.6181450Z contiguous: bool, 2025-05-07T20:31:57.6181696Z compiled: bool, 2025-05-07T20:31:57.6181919Z ) -> None: 2025-05-07T20:31:57.6182136Z torch.manual_seed(2025) 2025-05-07T20:31:57.6182385Z 2025-05-07T20:31:57.6182664Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6183005Z 2025-05-07T20:31:57.6183205Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6183508Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6183816Z x = x_sign * x_clamp 2025-05-07T20:31:57.6184062Z x0 = x[:, :D] 2025-05-07T20:31:57.6184278Z x1 = x[:, D:] 2025-05-07T20:31:57.6184481Z 2025-05-07T20:31:57.6184669Z if contiguous: 2025-05-07T20:31:57.6184901Z x0 = x0.contiguous() 2025-05-07T20:31:57.6185157Z x1 = x1.contiguous() 2025-05-07T20:31:57.6185401Z 2025-05-07T20:31:57.6185594Z if scale_ub is not None: 2025-05-07T20:31:57.6185862Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6186202Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6186516Z ) 2025-05-07T20:31:57.6186704Z else: 2025-05-07T20:31:57.6187040Z scale_ub_tensor = None 2025-05-07T20:31:57.6187296Z 2025-05-07T20:31:57.6187524Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6187845Z op = silu_mul_quant 2025-05-07T20:31:57.6188103Z if compiled: 2025-05-07T20:31:57.6188353Z op = torch.compile(op) 2025-05-07T20:31:57.6188648Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6188927Z 2025-05-07T20:31:57.6189120Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6189287Z 2025-05-07T20:31:57.6189388Z moe/activation_test.py:117: 2025-05-07T20:31:57.6189689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6190024Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6190302Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6191012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6191726Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6192328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6193028Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6193713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6194262Z kernel = self.compile( 2025-05-07T20:31:57.6194812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6195488Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6195898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6196131Z 2025-05-07T20:31:57.6196348Z self = 2025-05-07T20:31:57.6197460Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6198972Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4477e7c180>} 2025-05-07T20:31:57.6200368Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6201429Z context = 2025-05-07T20:31:57.6201726Z 2025-05-07T20:31:57.6201902Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6202434Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6202922Z module_map=module_map) 2025-05-07T20:31:57.6203299Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6203654Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6203928Z E ^ 2025-05-07T20:31:57.6204405Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6204868Z 2025-05-07T20:31:57.6205300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6205827Z 2025-05-07T20:31:57.6205931Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6206689Z self=, 2025-05-07T20:31:57.6207111Z T=4096, 2025-05-07T20:31:57.6207294Z D=7168, 2025-05-07T20:31:57.6207491Z scale_ub=None, 2025-05-07T20:31:57.6207710Z contiguous=False, 2025-05-07T20:31:57.6208130Z compiled=False, 2025-05-07T20:31:57.6208331Z ) 2025-05-07T20:31:57.6208660Z self = 2025-05-07T20:31:57.6209184Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:57.6209472Z 2025-05-07T20:31:57.6209548Z @given( 2025-05-07T20:31:57.6209788Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6210109Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6210411Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6210745Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6211079Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6211379Z ) 2025-05-07T20:31:57.6211851Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6212309Z def test_silu_mul_quant( 2025-05-07T20:31:57.6212553Z self, 2025-05-07T20:31:57.6212745Z T: int, 2025-05-07T20:31:57.6212959Z D: int, 2025-05-07T20:31:57.6213181Z scale_ub: Optional[float], 2025-05-07T20:31:57.6213451Z contiguous: bool, 2025-05-07T20:31:57.6213696Z compiled: bool, 2025-05-07T20:31:57.6213932Z ) -> None: 2025-05-07T20:31:57.6214149Z torch.manual_seed(2025) 2025-05-07T20:31:57.6214397Z 2025-05-07T20:31:57.6214677Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6215020Z 2025-05-07T20:31:57.6215220Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6215511Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6215816Z x = x_sign * x_clamp 2025-05-07T20:31:57.6216056Z x0 = x[:, :D] 2025-05-07T20:31:57.6216268Z x1 = x[:, D:] 2025-05-07T20:31:57.6216470Z 2025-05-07T20:31:57.6216655Z if contiguous: 2025-05-07T20:31:57.6216885Z x0 = x0.contiguous() 2025-05-07T20:31:57.6217143Z x1 = x1.contiguous() 2025-05-07T20:31:57.6217381Z 2025-05-07T20:31:57.6217570Z if scale_ub is not None: 2025-05-07T20:31:57.6217843Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6218177Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6218627Z ) 2025-05-07T20:31:57.6218826Z else: 2025-05-07T20:31:57.6219026Z scale_ub_tensor = None 2025-05-07T20:31:57.6219279Z 2025-05-07T20:31:57.6219508Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6219915Z op = silu_mul_quant 2025-05-07T20:31:57.6220256Z if compiled: 2025-05-07T20:31:57.6220509Z op = torch.compile(op) 2025-05-07T20:31:57.6220803Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6221081Z 2025-05-07T20:31:57.6221274Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6221440Z 2025-05-07T20:31:57.6221547Z moe/activation_test.py:117: 2025-05-07T20:31:57.6221866Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6222238Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6222521Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6223235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6223955Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6224508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6225207Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6225897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6226451Z kernel = self.compile( 2025-05-07T20:31:57.6227010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6227781Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6228191Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6228434Z 2025-05-07T20:31:57.6228650Z self = 2025-05-07T20:31:57.6229770Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6231193Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4477e7cfe0>} 2025-05-07T20:31:57.6232645Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6233737Z context = 2025-05-07T20:31:57.6234045Z 2025-05-07T20:31:57.6242586Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6243148Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6243627Z module_map=module_map) 2025-05-07T20:31:57.6243800Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6243900Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6243984Z E ^ 2025-05-07T20:31:57.6244357Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6244363Z 2025-05-07T20:31:57.6244795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6244799Z 2025-05-07T20:31:57.6244918Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6245148Z self=, 2025-05-07T20:31:57.6245230Z T=128, 2025-05-07T20:31:57.6245318Z D=7168, 2025-05-07T20:31:57.6245514Z scale_ub=None, 2025-05-07T20:31:57.6245613Z contiguous=False, 2025-05-07T20:31:57.6245697Z compiled=True, 2025-05-07T20:31:57.6245772Z ) 2025-05-07T20:31:57.6246004Z self = 2025-05-07T20:31:57.6246180Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:57.6246185Z 2025-05-07T20:31:57.6246263Z @given( 2025-05-07T20:31:57.6246391Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6246491Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6246607Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6246736Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6246856Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6246939Z ) 2025-05-07T20:31:57.6247194Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6247292Z def test_silu_mul_quant( 2025-05-07T20:31:57.6247382Z self, 2025-05-07T20:31:57.6247461Z T: int, 2025-05-07T20:31:57.6247538Z D: int, 2025-05-07T20:31:57.6247644Z scale_ub: Optional[float], 2025-05-07T20:31:57.6247734Z contiguous: bool, 2025-05-07T20:31:57.6247822Z compiled: bool, 2025-05-07T20:31:57.6247909Z ) -> None: 2025-05-07T20:31:57.6248005Z torch.manual_seed(2025) 2025-05-07T20:31:57.6248078Z 2025-05-07T20:31:57.6248258Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6248333Z 2025-05-07T20:31:57.6248435Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6248561Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6248650Z x = x_sign * x_clamp 2025-05-07T20:31:57.6248822Z x0 = x[:, :D] 2025-05-07T20:31:57.6248903Z x1 = x[:, D:] 2025-05-07T20:31:57.6248973Z 2025-05-07T20:31:57.6249065Z if contiguous: 2025-05-07T20:31:57.6249163Z x0 = x0.contiguous() 2025-05-07T20:31:57.6249259Z x1 = x1.contiguous() 2025-05-07T20:31:57.6249333Z 2025-05-07T20:31:57.6249422Z if scale_ub is not None: 2025-05-07T20:31:57.6249535Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6249673Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6249748Z ) 2025-05-07T20:31:57.6249833Z else: 2025-05-07T20:31:57.6249927Z scale_ub_tensor = None 2025-05-07T20:31:57.6249999Z 2025-05-07T20:31:57.6250136Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6250226Z op = silu_mul_quant 2025-05-07T20:31:57.6250311Z if compiled: 2025-05-07T20:31:57.6250417Z op = torch.compile(op) 2025-05-07T20:31:57.6250528Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6250611Z 2025-05-07T20:31:57.6250704Z y_fp8, y_scale = fn() 2025-05-07T20:31:57.6250831Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:57.6250909Z 2025-05-07T20:31:57.6251046Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6251147Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:57.6251255Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:57.6251379Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:57.6251523Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.6251616Z 2025-05-07T20:31:57.6251727Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:57.6251732Z 2025-05-07T20:31:57.6251959Z moe/activation_test.py:126: 2025-05-07T20:31:57.6252093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6252208Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:57.6252353Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.6253045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:57.6253153Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:57.6253531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6253760Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6254145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:57.6254408Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:57.6254795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:57.6254976Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:57.6255333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:57.6255415Z fn() 2025-05-07T20:31:57.6255830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:57.6255913Z self.fn.run( 2025-05-07T20:31:57.6256267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6256361Z kernel = self.compile( 2025-05-07T20:31:57.6256756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6256938Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6257069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6257153Z 2025-05-07T20:31:57.6257368Z self = 2025-05-07T20:31:57.6258184Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6258703Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4477e7dee0>} 2025-05-07T20:31:57.6259483Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6259677Z context = 2025-05-07T20:31:57.6259682Z 2025-05-07T20:31:57.6259862Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6260132Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6260244Z module_map=module_map) 2025-05-07T20:31:57.6260416Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6260516Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:57.6260601Z E ^ 2025-05-07T20:31:57.6260969Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6260974Z 2025-05-07T20:31:57.6261401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6261406Z 2025-05-07T20:31:57.6261515Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6261744Z self=, 2025-05-07T20:31:57.6261830Z T=128, 2025-05-07T20:31:57.6261914Z D=7168, 2025-05-07T20:31:57.6262015Z scale_ub=None, 2025-05-07T20:31:57.6262118Z contiguous=False, 2025-05-07T20:31:57.6262217Z compiled=False, 2025-05-07T20:31:57.6262289Z ) 2025-05-07T20:31:57.6262621Z self = 2025-05-07T20:31:57.6262799Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:57.6262804Z 2025-05-07T20:31:57.6262879Z @given( 2025-05-07T20:31:57.6263004Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6263103Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6263223Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6263340Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6263454Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6263533Z ) 2025-05-07T20:31:57.6263785Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6263882Z def test_silu_mul_quant( 2025-05-07T20:31:57.6263963Z self, 2025-05-07T20:31:57.6264040Z T: int, 2025-05-07T20:31:57.6264117Z D: int, 2025-05-07T20:31:57.6264226Z scale_ub: Optional[float], 2025-05-07T20:31:57.6264315Z contiguous: bool, 2025-05-07T20:31:57.6264400Z compiled: bool, 2025-05-07T20:31:57.6264485Z ) -> None: 2025-05-07T20:31:57.6264579Z torch.manual_seed(2025) 2025-05-07T20:31:57.6264658Z 2025-05-07T20:31:57.6264829Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6264902Z 2025-05-07T20:31:57.6265002Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6265127Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6265217Z x = x_sign * x_clamp 2025-05-07T20:31:57.6265304Z x0 = x[:, :D] 2025-05-07T20:31:57.6265383Z x1 = x[:, D:] 2025-05-07T20:31:57.6265459Z 2025-05-07T20:31:57.6265634Z if contiguous: 2025-05-07T20:31:57.6265726Z x0 = x0.contiguous() 2025-05-07T20:31:57.6265818Z x1 = x1.contiguous() 2025-05-07T20:31:57.6265897Z 2025-05-07T20:31:57.6265993Z if scale_ub is not None: 2025-05-07T20:31:57.6266099Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6266246Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6266321Z ) 2025-05-07T20:31:57.6266403Z else: 2025-05-07T20:31:57.6266497Z scale_ub_tensor = None 2025-05-07T20:31:57.6266569Z 2025-05-07T20:31:57.6266706Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6266796Z op = silu_mul_quant 2025-05-07T20:31:57.6266881Z if compiled: 2025-05-07T20:31:57.6266987Z op = torch.compile(op) 2025-05-07T20:31:57.6267092Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6267164Z 2025-05-07T20:31:57.6267268Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6267272Z 2025-05-07T20:31:57.6267371Z moe/activation_test.py:117: 2025-05-07T20:31:57.6267509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6267617Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6267721Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6268247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6268343Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6268713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6268952Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6269307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6269407Z kernel = self.compile( 2025-05-07T20:31:57.6269811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6269988Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6270204Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6270209Z 2025-05-07T20:31:57.6270416Z self = 2025-05-07T20:31:57.6271227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6271792Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4477844cc0>} 2025-05-07T20:31:57.6272567Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6272779Z context = 2025-05-07T20:31:57.6272784Z 2025-05-07T20:31:57.6272956Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6273232Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6273338Z module_map=module_map) 2025-05-07T20:31:57.6273503Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6273608Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6273683Z E ^ 2025-05-07T20:31:57.6274048Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6274059Z 2025-05-07T20:31:57.6274564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6274569Z 2025-05-07T20:31:57.6274671Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6274911Z self=, 2025-05-07T20:31:57.6274987Z T=4096, 2025-05-07T20:31:57.6275063Z D=5120, 2025-05-07T20:31:57.6275153Z scale_ub=1200.0, 2025-05-07T20:31:57.6275238Z contiguous=True, 2025-05-07T20:31:57.6275322Z compiled=False, 2025-05-07T20:31:57.6275400Z ) 2025-05-07T20:31:57.6275625Z self = 2025-05-07T20:31:57.6275812Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:57.6275816Z 2025-05-07T20:31:57.6275893Z @given( 2025-05-07T20:31:57.6276013Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6276118Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6276239Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6276357Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6276477Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6276557Z ) 2025-05-07T20:31:57.6276808Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6276908Z def test_silu_mul_quant( 2025-05-07T20:31:57.6276984Z self, 2025-05-07T20:31:57.6277066Z T: int, 2025-05-07T20:31:57.6277142Z D: int, 2025-05-07T20:31:57.6277239Z scale_ub: Optional[float], 2025-05-07T20:31:57.6277334Z contiguous: bool, 2025-05-07T20:31:57.6277419Z compiled: bool, 2025-05-07T20:31:57.6277497Z ) -> None: 2025-05-07T20:31:57.6277597Z torch.manual_seed(2025) 2025-05-07T20:31:57.6277670Z 2025-05-07T20:31:57.6277841Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6277929Z 2025-05-07T20:31:57.6278020Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6278148Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6278242Z x = x_sign * x_clamp 2025-05-07T20:31:57.6278322Z x0 = x[:, :D] 2025-05-07T20:31:57.6278498Z x1 = x[:, D:] 2025-05-07T20:31:57.6278571Z 2025-05-07T20:31:57.6278656Z if contiguous: 2025-05-07T20:31:57.6278755Z x0 = x0.contiguous() 2025-05-07T20:31:57.6278845Z x1 = x1.contiguous() 2025-05-07T20:31:57.6278917Z 2025-05-07T20:31:57.6279011Z if scale_ub is not None: 2025-05-07T20:31:57.6279116Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6279252Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6279333Z ) 2025-05-07T20:31:57.6279411Z else: 2025-05-07T20:31:57.6279506Z scale_ub_tensor = None 2025-05-07T20:31:57.6279584Z 2025-05-07T20:31:57.6279713Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6279817Z op = silu_mul_quant 2025-05-07T20:31:57.6279902Z if compiled: 2025-05-07T20:31:57.6280001Z op = torch.compile(op) 2025-05-07T20:31:57.6280120Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6280192Z 2025-05-07T20:31:57.6280283Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6280287Z 2025-05-07T20:31:57.6280392Z moe/activation_test.py:117: 2025-05-07T20:31:57.6280525Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6280625Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6280731Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6281245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6281349Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6281718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6282027Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6282389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6282483Z kernel = self.compile( 2025-05-07T20:31:57.6282878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6283068Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6283199Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6283203Z 2025-05-07T20:31:57.6283417Z self = 2025-05-07T20:31:57.6284223Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6284758Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4477844e00>} 2025-05-07T20:31:57.6285532Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6285726Z context = 2025-05-07T20:31:57.6285730Z 2025-05-07T20:31:57.6285904Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6286175Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6286287Z module_map=module_map) 2025-05-07T20:31:57.6286451Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6286555Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6286636Z E ^ 2025-05-07T20:31:57.6287100Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6287105Z 2025-05-07T20:31:57.6287533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6287546Z 2025-05-07T20:31:57.6287648Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6287877Z self=, 2025-05-07T20:31:57.6287959Z T=1, 2025-05-07T20:31:57.6288034Z D=5120, 2025-05-07T20:31:57.6288116Z scale_ub=None, 2025-05-07T20:31:57.6288208Z contiguous=True, 2025-05-07T20:31:57.6288291Z compiled=True, 2025-05-07T20:31:57.6288363Z ) 2025-05-07T20:31:57.6288597Z self = 2025-05-07T20:31:57.6288766Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:57.6288770Z 2025-05-07T20:31:57.6288846Z @given( 2025-05-07T20:31:57.6288976Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6289076Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6289196Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6289314Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6289428Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6289507Z ) 2025-05-07T20:31:57.6289759Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6289852Z def test_silu_mul_quant( 2025-05-07T20:31:57.6289935Z self, 2025-05-07T20:31:57.6290011Z T: int, 2025-05-07T20:31:57.6290086Z D: int, 2025-05-07T20:31:57.6290191Z scale_ub: Optional[float], 2025-05-07T20:31:57.6290279Z contiguous: bool, 2025-05-07T20:31:57.6290446Z compiled: bool, 2025-05-07T20:31:57.6290530Z ) -> None: 2025-05-07T20:31:57.6290624Z torch.manual_seed(2025) 2025-05-07T20:31:57.6290702Z 2025-05-07T20:31:57.6290880Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6290955Z 2025-05-07T20:31:57.6291054Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6291180Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6291267Z x = x_sign * x_clamp 2025-05-07T20:31:57.6291354Z x0 = x[:, :D] 2025-05-07T20:31:57.6291434Z x1 = x[:, D:] 2025-05-07T20:31:57.6291508Z 2025-05-07T20:31:57.6291599Z if contiguous: 2025-05-07T20:31:57.6291689Z x0 = x0.contiguous() 2025-05-07T20:31:57.6291864Z x1 = x1.contiguous() 2025-05-07T20:31:57.6291944Z 2025-05-07T20:31:57.6292036Z if scale_ub is not None: 2025-05-07T20:31:57.6292147Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6292290Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6292366Z ) 2025-05-07T20:31:57.6292453Z else: 2025-05-07T20:31:57.6292548Z scale_ub_tensor = None 2025-05-07T20:31:57.6292626Z 2025-05-07T20:31:57.6292762Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6292852Z op = silu_mul_quant 2025-05-07T20:31:57.6292938Z if compiled: 2025-05-07T20:31:57.6293042Z op = torch.compile(op) 2025-05-07T20:31:57.6293148Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6293219Z 2025-05-07T20:31:57.6293315Z y_fp8, y_scale = fn() 2025-05-07T20:31:57.6293439Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:57.6293515Z 2025-05-07T20:31:57.6293653Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6293754Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:57.6293859Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:57.6293988Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:57.6294131Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.6294212Z 2025-05-07T20:31:57.6294398Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:57.6294403Z 2025-05-07T20:31:57.6294502Z moe/activation_test.py:126: 2025-05-07T20:31:57.6294648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6294754Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:57.6294896Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.6295481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:57.6295582Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:57.6295959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6296192Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6296584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:57.6296846Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:57.6297235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:57.6297411Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:57.6297767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:57.6297843Z fn() 2025-05-07T20:31:57.6298267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:57.6298350Z self.fn.run( 2025-05-07T20:31:57.6298784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6298877Z kernel = self.compile( 2025-05-07T20:31:57.6299278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6299462Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6299594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6299598Z 2025-05-07T20:31:57.6299806Z self = 2025-05-07T20:31:57.6300632Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6301151Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4477846fc0>} 2025-05-07T20:31:57.6301947Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6302140Z context = 2025-05-07T20:31:57.6302144Z 2025-05-07T20:31:57.6302317Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6302588Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6302695Z module_map=module_map) 2025-05-07T20:31:57.6302869Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6302970Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:57.6303046Z E ^ 2025-05-07T20:31:57.6303425Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6303430Z 2025-05-07T20:31:57.6303936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6303941Z 2025-05-07T20:31:57.6304057Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6304287Z self=, 2025-05-07T20:31:57.6304362Z T=2048, 2025-05-07T20:31:57.6304447Z D=5120, 2025-05-07T20:31:57.6304528Z scale_ub=None, 2025-05-07T20:31:57.6304628Z contiguous=True, 2025-05-07T20:31:57.6304713Z compiled=True, 2025-05-07T20:31:57.6304784Z ) 2025-05-07T20:31:57.6305020Z self = 2025-05-07T20:31:57.6305198Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:57.6305203Z 2025-05-07T20:31:57.6305290Z @given( 2025-05-07T20:31:57.6305411Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6305512Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6305637Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6305760Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6305876Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6305956Z ) 2025-05-07T20:31:57.6306491Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6306618Z def test_silu_mul_quant( 2025-05-07T20:31:57.6306703Z self, 2025-05-07T20:31:57.6306781Z T: int, 2025-05-07T20:31:57.6306861Z D: int, 2025-05-07T20:31:57.6306959Z scale_ub: Optional[float], 2025-05-07T20:31:57.6307048Z contiguous: bool, 2025-05-07T20:31:57.6307140Z compiled: bool, 2025-05-07T20:31:57.6307219Z ) -> None: 2025-05-07T20:31:57.6307315Z torch.manual_seed(2025) 2025-05-07T20:31:57.6307585Z 2025-05-07T20:31:57.6307757Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6307831Z 2025-05-07T20:31:57.6307930Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6308062Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6308152Z x = x_sign * x_clamp 2025-05-07T20:31:57.6308239Z x0 = x[:, :D] 2025-05-07T20:31:57.6308322Z x1 = x[:, D:] 2025-05-07T20:31:57.6308405Z 2025-05-07T20:31:57.6308490Z if contiguous: 2025-05-07T20:31:57.6308585Z x0 = x0.contiguous() 2025-05-07T20:31:57.6308679Z x1 = x1.contiguous() 2025-05-07T20:31:57.6308751Z 2025-05-07T20:31:57.6308841Z if scale_ub is not None: 2025-05-07T20:31:57.6308951Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6309088Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6309162Z ) 2025-05-07T20:31:57.6309249Z else: 2025-05-07T20:31:57.6309342Z scale_ub_tensor = None 2025-05-07T20:31:57.6309412Z 2025-05-07T20:31:57.6309547Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6309642Z op = silu_mul_quant 2025-05-07T20:31:57.6309727Z if compiled: 2025-05-07T20:31:57.6309831Z op = torch.compile(op) 2025-05-07T20:31:57.6309935Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6310012Z 2025-05-07T20:31:57.6310103Z y_fp8, y_scale = fn() 2025-05-07T20:31:57.6310225Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:57.6310302Z 2025-05-07T20:31:57.6310438Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6310542Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:57.6310647Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:57.6310768Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:57.6310910Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.6310995Z 2025-05-07T20:31:57.6311095Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:57.6311100Z 2025-05-07T20:31:57.6311204Z moe/activation_test.py:126: 2025-05-07T20:31:57.6311470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6311577Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:57.6311720Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.6312300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:57.6312402Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:57.6312780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6313006Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6313397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:57.6313667Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:57.6314059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:57.6314235Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:57.6314589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:57.6314672Z fn() 2025-05-07T20:31:57.6315090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:57.6315177Z self.fn.run( 2025-05-07T20:31:57.6315534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6315733Z kernel = self.compile( 2025-05-07T20:31:57.6316134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6316329Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6316459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6316464Z 2025-05-07T20:31:57.6316680Z self = 2025-05-07T20:31:57.6317483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6318001Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44777f54e0>} 2025-05-07T20:31:57.6318788Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6318988Z context = 2025-05-07T20:31:57.6318993Z 2025-05-07T20:31:57.6319167Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6319438Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6319547Z module_map=module_map) 2025-05-07T20:31:57.6319718Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6319820Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:57.6319901Z E ^ 2025-05-07T20:31:57.6320271Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6320280Z 2025-05-07T20:31:57.6320708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6320713Z 2025-05-07T20:31:57.6320823Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6321132Z self=, 2025-05-07T20:31:57.6321216Z T=128, 2025-05-07T20:31:57.6321297Z D=5120, 2025-05-07T20:31:57.6321380Z scale_ub=None, 2025-05-07T20:31:57.6321471Z contiguous=True, 2025-05-07T20:31:57.6321556Z compiled=True, 2025-05-07T20:31:57.6321629Z ) 2025-05-07T20:31:57.6321861Z self = 2025-05-07T20:31:57.6322032Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:57.6322037Z 2025-05-07T20:31:57.6322115Z @given( 2025-05-07T20:31:57.6322241Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6322341Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6322467Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6322586Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6322702Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6322789Z ) 2025-05-07T20:31:57.6323042Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6323137Z def test_silu_mul_quant( 2025-05-07T20:31:57.6323223Z self, 2025-05-07T20:31:57.6323298Z T: int, 2025-05-07T20:31:57.6323374Z D: int, 2025-05-07T20:31:57.6323478Z scale_ub: Optional[float], 2025-05-07T20:31:57.6323568Z contiguous: bool, 2025-05-07T20:31:57.6323654Z compiled: bool, 2025-05-07T20:31:57.6323737Z ) -> None: 2025-05-07T20:31:57.6323832Z torch.manual_seed(2025) 2025-05-07T20:31:57.6323914Z 2025-05-07T20:31:57.6324084Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6324239Z 2025-05-07T20:31:57.6324339Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6324463Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6324552Z x = x_sign * x_clamp 2025-05-07T20:31:57.6324642Z x0 = x[:, :D] 2025-05-07T20:31:57.6324721Z x1 = x[:, D:] 2025-05-07T20:31:57.6324793Z 2025-05-07T20:31:57.6324882Z if contiguous: 2025-05-07T20:31:57.6324974Z x0 = x0.contiguous() 2025-05-07T20:31:57.6325063Z x1 = x1.contiguous() 2025-05-07T20:31:57.6325142Z 2025-05-07T20:31:57.6325231Z if scale_ub is not None: 2025-05-07T20:31:57.6325337Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6325478Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6325552Z ) 2025-05-07T20:31:57.6325634Z else: 2025-05-07T20:31:57.6325727Z scale_ub_tensor = None 2025-05-07T20:31:57.6325800Z 2025-05-07T20:31:57.6325936Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6326032Z op = silu_mul_quant 2025-05-07T20:31:57.6326116Z if compiled: 2025-05-07T20:31:57.6326222Z op = torch.compile(op) 2025-05-07T20:31:57.6326330Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6326402Z 2025-05-07T20:31:57.6326497Z y_fp8, y_scale = fn() 2025-05-07T20:31:57.6326618Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:57.6326691Z 2025-05-07T20:31:57.6326833Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6326934Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:57.6327040Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:57.6327163Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:57.6327305Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.6327381Z 2025-05-07T20:31:57.6327482Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:57.6327490Z 2025-05-07T20:31:57.6327591Z moe/activation_test.py:126: 2025-05-07T20:31:57.6327730Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6327912Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:57.6328055Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.6328632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:57.6328733Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:57.6329108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6329336Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6329717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:57.6329991Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:57.6330378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:57.6330557Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:57.6330912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:57.6330988Z fn() 2025-05-07T20:31:57.6331409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:57.6331491Z self.fn.run( 2025-05-07T20:31:57.6331907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6332001Z kernel = self.compile( 2025-05-07T20:31:57.6332394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6332660Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6332791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6332795Z 2025-05-07T20:31:57.6333007Z self = 2025-05-07T20:31:57.6333821Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6334341Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4476b59ee0>} 2025-05-07T20:31:57.6335125Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6335325Z context = 2025-05-07T20:31:57.6335329Z 2025-05-07T20:31:57.6335508Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6335779Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6335885Z module_map=module_map) 2025-05-07T20:31:57.6336055Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6336158Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:57.6336237Z E ^ 2025-05-07T20:31:57.6336610Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6336615Z 2025-05-07T20:31:57.6337040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6337050Z 2025-05-07T20:31:57.6337161Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6337390Z self=, 2025-05-07T20:31:57.6337466Z T=4096, 2025-05-07T20:31:57.6337636Z D=5120, 2025-05-07T20:31:57.6337720Z scale_ub=None, 2025-05-07T20:31:57.6337805Z contiguous=True, 2025-05-07T20:31:57.6337896Z compiled=True, 2025-05-07T20:31:57.6337970Z ) 2025-05-07T20:31:57.6338201Z self = 2025-05-07T20:31:57.6338377Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:57.6338381Z 2025-05-07T20:31:57.6338460Z @given( 2025-05-07T20:31:57.6338587Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6338686Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6338801Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6338925Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6339045Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6339121Z ) 2025-05-07T20:31:57.6339378Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6339477Z def test_silu_mul_quant( 2025-05-07T20:31:57.6339562Z self, 2025-05-07T20:31:57.6339639Z T: int, 2025-05-07T20:31:57.6339717Z D: int, 2025-05-07T20:31:57.6339820Z scale_ub: Optional[float], 2025-05-07T20:31:57.6339909Z contiguous: bool, 2025-05-07T20:31:57.6339993Z compiled: bool, 2025-05-07T20:31:57.6340078Z ) -> None: 2025-05-07T20:31:57.6340174Z torch.manual_seed(2025) 2025-05-07T20:31:57.6340248Z 2025-05-07T20:31:57.6340424Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6340498Z 2025-05-07T20:31:57.6340592Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6340723Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6340894Z x = x_sign * x_clamp 2025-05-07T20:31:57.6340980Z x0 = x[:, :D] 2025-05-07T20:31:57.6341060Z x1 = x[:, D:] 2025-05-07T20:31:57.6341133Z 2025-05-07T20:31:57.6341220Z if contiguous: 2025-05-07T20:31:57.6341317Z x0 = x0.contiguous() 2025-05-07T20:31:57.6341406Z x1 = x1.contiguous() 2025-05-07T20:31:57.6341484Z 2025-05-07T20:31:57.6341572Z if scale_ub is not None: 2025-05-07T20:31:57.6341676Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6341818Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6341892Z ) 2025-05-07T20:31:57.6341970Z else: 2025-05-07T20:31:57.6342068Z scale_ub_tensor = None 2025-05-07T20:31:57.6342140Z 2025-05-07T20:31:57.6342271Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6342367Z op = silu_mul_quant 2025-05-07T20:31:57.6342451Z if compiled: 2025-05-07T20:31:57.6342562Z op = torch.compile(op) 2025-05-07T20:31:57.6342667Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6342740Z 2025-05-07T20:31:57.6342835Z y_fp8, y_scale = fn() 2025-05-07T20:31:57.6342961Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:57.6343033Z 2025-05-07T20:31:57.6343175Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6343277Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:57.6343377Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:57.6343505Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:57.6343647Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.6343724Z 2025-05-07T20:31:57.6343823Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:57.6343827Z 2025-05-07T20:31:57.6343924Z moe/activation_test.py:126: 2025-05-07T20:31:57.6344063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6344181Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:57.6344317Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.6344985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:57.6345087Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:57.6345462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6345688Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6346068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:57.6346335Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:57.6346723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:57.6346897Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:57.6347262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:57.6347341Z fn() 2025-05-07T20:31:57.6347759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:57.6347842Z self.fn.run( 2025-05-07T20:31:57.6348190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6348289Z kernel = self.compile( 2025-05-07T20:31:57.6348681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6348861Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6349003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6349107Z 2025-05-07T20:31:57.6349316Z self = 2025-05-07T20:31:57.6350139Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6350654Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44768a8ea0>} 2025-05-07T20:31:57.6351436Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6351632Z context = 2025-05-07T20:31:57.6351642Z 2025-05-07T20:31:57.6351811Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6352089Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6352201Z module_map=module_map) 2025-05-07T20:31:57.6352373Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6352476Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:57.6352553Z E ^ 2025-05-07T20:31:57.6352930Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6352935Z 2025-05-07T20:31:57.6353367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6353371Z 2025-05-07T20:31:57.6353485Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6353716Z self=, 2025-05-07T20:31:57.6353801Z T=16384, 2025-05-07T20:31:57.6353882Z D=5120, 2025-05-07T20:31:57.6353965Z scale_ub=None, 2025-05-07T20:31:57.6354053Z contiguous=True, 2025-05-07T20:31:57.6354224Z compiled=True, 2025-05-07T20:31:57.6354298Z ) 2025-05-07T20:31:57.6354523Z self = 2025-05-07T20:31:57.6354709Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:57.6354713Z 2025-05-07T20:31:57.6354790Z @given( 2025-05-07T20:31:57.6354913Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6355020Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6355138Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6355262Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6355377Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6355451Z ) 2025-05-07T20:31:57.6355714Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6355808Z def test_silu_mul_quant( 2025-05-07T20:31:57.6355886Z self, 2025-05-07T20:31:57.6355969Z T: int, 2025-05-07T20:31:57.6356049Z D: int, 2025-05-07T20:31:57.6356147Z scale_ub: Optional[float], 2025-05-07T20:31:57.6356246Z contiguous: bool, 2025-05-07T20:31:57.6356330Z compiled: bool, 2025-05-07T20:31:57.6356415Z ) -> None: 2025-05-07T20:31:57.6356509Z torch.manual_seed(2025) 2025-05-07T20:31:57.6356587Z 2025-05-07T20:31:57.6356763Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6356836Z 2025-05-07T20:31:57.6356929Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6357060Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6357148Z x = x_sign * x_clamp 2025-05-07T20:31:57.6357226Z x0 = x[:, :D] 2025-05-07T20:31:57.6357315Z x1 = x[:, D:] 2025-05-07T20:31:57.6357470Z 2025-05-07T20:31:57.6357555Z if contiguous: 2025-05-07T20:31:57.6357653Z x0 = x0.contiguous() 2025-05-07T20:31:57.6357742Z x1 = x1.contiguous() 2025-05-07T20:31:57.6357814Z 2025-05-07T20:31:57.6357915Z if scale_ub is not None: 2025-05-07T20:31:57.6358020Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6358162Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6358239Z ) 2025-05-07T20:31:57.6358314Z else: 2025-05-07T20:31:57.6358412Z scale_ub_tensor = None 2025-05-07T20:31:57.6358484Z 2025-05-07T20:31:57.6358614Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6358709Z op = silu_mul_quant 2025-05-07T20:31:57.6358793Z if compiled: 2025-05-07T20:31:57.6358892Z op = torch.compile(op) 2025-05-07T20:31:57.6359004Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6359081Z 2025-05-07T20:31:57.6359170Z y_fp8, y_scale = fn() 2025-05-07T20:31:57.6359301Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:57.6359373Z 2025-05-07T20:31:57.6359523Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6359625Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:57.6359725Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:57.6359854Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:57.6359996Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.6360068Z 2025-05-07T20:31:57.6360173Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:57.6360177Z 2025-05-07T20:31:57.6360275Z moe/activation_test.py:126: 2025-05-07T20:31:57.6360408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6360519Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:57.6360653Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.6361241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:57.6361423Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:57.6361796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6362029Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6362409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:57.6362678Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:57.6363064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:57.6363233Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:57.6363598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:57.6363676Z fn() 2025-05-07T20:31:57.6364096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:57.6364185Z self.fn.run( 2025-05-07T20:31:57.6364533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6364633Z kernel = self.compile( 2025-05-07T20:31:57.6365025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6365203Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6365341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6365345Z 2025-05-07T20:31:57.6365551Z self = 2025-05-07T20:31:57.6366450Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6366969Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44768a9bc0>} 2025-05-07T20:31:57.6367752Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6367953Z context = 2025-05-07T20:31:57.6367957Z 2025-05-07T20:31:57.6374124Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6374432Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6374543Z module_map=module_map) 2025-05-07T20:31:57.6374722Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6374821Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:57.6374906Z E ^ 2025-05-07T20:31:57.6375273Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6375279Z 2025-05-07T20:31:57.6375715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6375720Z 2025-05-07T20:31:57.6375824Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6376051Z self=, 2025-05-07T20:31:57.6376138Z T=1, 2025-05-07T20:31:57.6376215Z D=5120, 2025-05-07T20:31:57.6376299Z scale_ub=1200.0, 2025-05-07T20:31:57.6376397Z contiguous=True, 2025-05-07T20:31:57.6376483Z compiled=True, 2025-05-07T20:31:57.6376558Z ) 2025-05-07T20:31:57.6376790Z self = 2025-05-07T20:31:57.6377074Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:57.6377079Z 2025-05-07T20:31:57.6377166Z @given( 2025-05-07T20:31:57.6377291Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6377391Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6377517Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6377636Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6377751Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6377836Z ) 2025-05-07T20:31:57.6378091Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6378187Z def test_silu_mul_quant( 2025-05-07T20:31:57.6378279Z self, 2025-05-07T20:31:57.6378357Z T: int, 2025-05-07T20:31:57.6378441Z D: int, 2025-05-07T20:31:57.6378540Z scale_ub: Optional[float], 2025-05-07T20:31:57.6378631Z contiguous: bool, 2025-05-07T20:31:57.6378730Z compiled: bool, 2025-05-07T20:31:57.6378810Z ) -> None: 2025-05-07T20:31:57.6378906Z torch.manual_seed(2025) 2025-05-07T20:31:57.6378989Z 2025-05-07T20:31:57.6379165Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6379246Z 2025-05-07T20:31:57.6379347Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6379474Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6379565Z x = x_sign * x_clamp 2025-05-07T20:31:57.6379657Z x0 = x[:, :D] 2025-05-07T20:31:57.6379738Z x1 = x[:, D:] 2025-05-07T20:31:57.6379812Z 2025-05-07T20:31:57.6379901Z if contiguous: 2025-05-07T20:31:57.6379994Z x0 = x0.contiguous() 2025-05-07T20:31:57.6380246Z x1 = x1.contiguous() 2025-05-07T20:31:57.6380318Z 2025-05-07T20:31:57.6380409Z if scale_ub is not None: 2025-05-07T20:31:57.6380523Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6380668Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6380750Z ) 2025-05-07T20:31:57.6380827Z else: 2025-05-07T20:31:57.6380923Z scale_ub_tensor = None 2025-05-07T20:31:57.6381005Z 2025-05-07T20:31:57.6381135Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6381226Z op = silu_mul_quant 2025-05-07T20:31:57.6381319Z if compiled: 2025-05-07T20:31:57.6381419Z op = torch.compile(op) 2025-05-07T20:31:57.6381528Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6381609Z 2025-05-07T20:31:57.6381699Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6381704Z 2025-05-07T20:31:57.6381804Z moe/activation_test.py:117: 2025-05-07T20:31:57.6381947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6382051Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6382158Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6382547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.6382640Z return fn(*args, **kwargs) 2025-05-07T20:31:57.6383162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6383260Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6383628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6383864Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6384214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6384323Z kernel = self.compile( 2025-05-07T20:31:57.6384717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6385001Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6385142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6385146Z 2025-05-07T20:31:57.6385355Z self = 2025-05-07T20:31:57.6386172Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6386689Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4476765620>} 2025-05-07T20:31:57.6387474Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6387675Z context = 2025-05-07T20:31:57.6387680Z 2025-05-07T20:31:57.6387848Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6388125Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6388231Z module_map=module_map) 2025-05-07T20:31:57.6388396Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6388500Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6388576Z E ^ 2025-05-07T20:31:57.6388953Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6389035Z 2025-05-07T20:31:57.6389464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6389469Z 2025-05-07T20:31:57.6389578Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6389815Z self=, 2025-05-07T20:31:57.6389893Z T=1, 2025-05-07T20:31:57.6389967Z D=5120, 2025-05-07T20:31:57.6390057Z scale_ub=None, 2025-05-07T20:31:57.6390143Z contiguous=False, 2025-05-07T20:31:57.6390233Z compiled=True, 2025-05-07T20:31:57.6390305Z ) 2025-05-07T20:31:57.6390530Z self = 2025-05-07T20:31:57.6390705Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:57.6390710Z 2025-05-07T20:31:57.6390787Z @given( 2025-05-07T20:31:57.6390907Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6391019Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6391134Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6391251Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6391380Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6391453Z ) 2025-05-07T20:31:57.6391716Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6391810Z def test_silu_mul_quant( 2025-05-07T20:31:57.6391885Z self, 2025-05-07T20:31:57.6391967Z T: int, 2025-05-07T20:31:57.6392044Z D: int, 2025-05-07T20:31:57.6392142Z scale_ub: Optional[float], 2025-05-07T20:31:57.6392236Z contiguous: bool, 2025-05-07T20:31:57.6392321Z compiled: bool, 2025-05-07T20:31:57.6392399Z ) -> None: 2025-05-07T20:31:57.6392502Z torch.manual_seed(2025) 2025-05-07T20:31:57.6392574Z 2025-05-07T20:31:57.6392743Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6392827Z 2025-05-07T20:31:57.6392919Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6393052Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6393226Z x = x_sign * x_clamp 2025-05-07T20:31:57.6393309Z x0 = x[:, :D] 2025-05-07T20:31:57.6393395Z x1 = x[:, D:] 2025-05-07T20:31:57.6393467Z 2025-05-07T20:31:57.6393552Z if contiguous: 2025-05-07T20:31:57.6393648Z x0 = x0.contiguous() 2025-05-07T20:31:57.6393736Z x1 = x1.contiguous() 2025-05-07T20:31:57.6393809Z 2025-05-07T20:31:57.6393906Z if scale_ub is not None: 2025-05-07T20:31:57.6394013Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6394150Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6394230Z ) 2025-05-07T20:31:57.6394307Z else: 2025-05-07T20:31:57.6394407Z scale_ub_tensor = None 2025-05-07T20:31:57.6394484Z 2025-05-07T20:31:57.6394615Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6394712Z op = silu_mul_quant 2025-05-07T20:31:57.6394796Z if compiled: 2025-05-07T20:31:57.6394902Z op = torch.compile(op) 2025-05-07T20:31:57.6395015Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6395087Z 2025-05-07T20:31:57.6395177Z y_fp8, y_scale = fn() 2025-05-07T20:31:57.6395308Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:57.6395380Z 2025-05-07T20:31:57.6395517Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6395627Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:57.6395727Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:57.6395860Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:57.6396003Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.6396074Z 2025-05-07T20:31:57.6396268Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:57.6396272Z 2025-05-07T20:31:57.6396371Z moe/activation_test.py:126: 2025-05-07T20:31:57.6396507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6396620Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:57.6396756Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.6397342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:57.6397443Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:57.6397814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6398047Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6398431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:57.6398701Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:57.6399099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:57.6399268Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:57.6399627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:57.6399705Z fn() 2025-05-07T20:31:57.6400123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:57.6400214Z self.fn.run( 2025-05-07T20:31:57.6400562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6400658Z kernel = self.compile( 2025-05-07T20:31:57.6401057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6401243Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6401459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6401463Z 2025-05-07T20:31:57.6401673Z self = 2025-05-07T20:31:57.6402530Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6403058Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4476766e80>} 2025-05-07T20:31:57.6403830Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6404036Z context = 2025-05-07T20:31:57.6404040Z 2025-05-07T20:31:57.6404211Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6404487Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6404593Z module_map=module_map) 2025-05-07T20:31:57.6404757Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6404866Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:57.6404942Z E ^ 2025-05-07T20:31:57.6405307Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6405312Z 2025-05-07T20:31:57.6405744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6405825Z 2025-05-07T20:31:57.6405929Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6406419Z self=, 2025-05-07T20:31:57.6406546Z T=1, 2025-05-07T20:31:57.6406652Z D=5120, 2025-05-07T20:31:57.6406775Z scale_ub=None, 2025-05-07T20:31:57.6406869Z contiguous=True, 2025-05-07T20:31:57.6406955Z compiled=False, 2025-05-07T20:31:57.6407032Z ) 2025-05-07T20:31:57.6407257Z self = 2025-05-07T20:31:57.6407424Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:57.6407435Z 2025-05-07T20:31:57.6407510Z @given( 2025-05-07T20:31:57.6407633Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6407738Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6407853Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6407978Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6408098Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6408171Z ) 2025-05-07T20:31:57.6408427Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6408526Z def test_silu_mul_quant( 2025-05-07T20:31:57.6408601Z self, 2025-05-07T20:31:57.6408678Z T: int, 2025-05-07T20:31:57.6408763Z D: int, 2025-05-07T20:31:57.6408861Z scale_ub: Optional[float], 2025-05-07T20:31:57.6408954Z contiguous: bool, 2025-05-07T20:31:57.6409038Z compiled: bool, 2025-05-07T20:31:57.6409117Z ) -> None: 2025-05-07T20:31:57.6409218Z torch.manual_seed(2025) 2025-05-07T20:31:57.6409292Z 2025-05-07T20:31:57.6409462Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6409541Z 2025-05-07T20:31:57.6409634Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6409764Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6409861Z x = x_sign * x_clamp 2025-05-07T20:31:57.6409941Z x0 = x[:, :D] 2025-05-07T20:31:57.6410019Z x1 = x[:, D:] 2025-05-07T20:31:57.6410304Z 2025-05-07T20:31:57.6410393Z if contiguous: 2025-05-07T20:31:57.6410494Z x0 = x0.contiguous() 2025-05-07T20:31:57.6410582Z x1 = x1.contiguous() 2025-05-07T20:31:57.6410654Z 2025-05-07T20:31:57.6410750Z if scale_ub is not None: 2025-05-07T20:31:57.6410854Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6410991Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6411071Z ) 2025-05-07T20:31:57.6411147Z else: 2025-05-07T20:31:57.6411240Z scale_ub_tensor = None 2025-05-07T20:31:57.6411322Z 2025-05-07T20:31:57.6411451Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6411540Z op = silu_mul_quant 2025-05-07T20:31:57.6411637Z if compiled: 2025-05-07T20:31:57.6411736Z op = torch.compile(op) 2025-05-07T20:31:57.6411960Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6412044Z 2025-05-07T20:31:57.6412140Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6412144Z 2025-05-07T20:31:57.6412263Z moe/activation_test.py:117: 2025-05-07T20:31:57.6412413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6412528Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6412636Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6413151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6413248Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6413626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6413985Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6414341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6414442Z kernel = self.compile( 2025-05-07T20:31:57.6414838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6415023Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6415153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6415157Z 2025-05-07T20:31:57.6415372Z self = 2025-05-07T20:31:57.6416177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6416698Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4476765ee0>} 2025-05-07T20:31:57.6417484Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6417678Z context = 2025-05-07T20:31:57.6417683Z 2025-05-07T20:31:57.6417859Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6418130Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6418237Z module_map=module_map) 2025-05-07T20:31:57.6418408Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6418509Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6418597Z E ^ 2025-05-07T20:31:57.6418962Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6418966Z 2025-05-07T20:31:57.6419498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6419503Z 2025-05-07T20:31:57.6419612Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6419841Z self=, 2025-05-07T20:31:57.6419926Z T=128, 2025-05-07T20:31:57.6420003Z D=5120, 2025-05-07T20:31:57.6420085Z scale_ub=None, 2025-05-07T20:31:57.6420176Z contiguous=False, 2025-05-07T20:31:57.6420259Z compiled=True, 2025-05-07T20:31:57.6420331Z ) 2025-05-07T20:31:57.6420562Z self = 2025-05-07T20:31:57.6420735Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:57.6420745Z 2025-05-07T20:31:57.6420821Z @given( 2025-05-07T20:31:57.6420946Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6421051Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6421166Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6421288Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6421402Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6421500Z ) 2025-05-07T20:31:57.6421782Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6421875Z def test_silu_mul_quant( 2025-05-07T20:31:57.6421956Z self, 2025-05-07T20:31:57.6422033Z T: int, 2025-05-07T20:31:57.6422107Z D: int, 2025-05-07T20:31:57.6422211Z scale_ub: Optional[float], 2025-05-07T20:31:57.6422299Z contiguous: bool, 2025-05-07T20:31:57.6422385Z compiled: bool, 2025-05-07T20:31:57.6422551Z ) -> None: 2025-05-07T20:31:57.6422647Z torch.manual_seed(2025) 2025-05-07T20:31:57.6422720Z 2025-05-07T20:31:57.6422901Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6422979Z 2025-05-07T20:31:57.6423078Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6423202Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6423291Z x = x_sign * x_clamp 2025-05-07T20:31:57.6423377Z x0 = x[:, :D] 2025-05-07T20:31:57.6423455Z x1 = x[:, D:] 2025-05-07T20:31:57.6423527Z 2025-05-07T20:31:57.6423615Z if contiguous: 2025-05-07T20:31:57.6423705Z x0 = x0.contiguous() 2025-05-07T20:31:57.6423795Z x1 = x1.contiguous() 2025-05-07T20:31:57.6423874Z 2025-05-07T20:31:57.6423964Z if scale_ub is not None: 2025-05-07T20:31:57.6424067Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6424212Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6424294Z ) 2025-05-07T20:31:57.6424374Z else: 2025-05-07T20:31:57.6424467Z scale_ub_tensor = None 2025-05-07T20:31:57.6424538Z 2025-05-07T20:31:57.6424678Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6424768Z op = silu_mul_quant 2025-05-07T20:31:57.6424853Z if compiled: 2025-05-07T20:31:57.6424957Z op = torch.compile(op) 2025-05-07T20:31:57.6425062Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6425133Z 2025-05-07T20:31:57.6425233Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6425238Z 2025-05-07T20:31:57.6425335Z moe/activation_test.py:117: 2025-05-07T20:31:57.6425465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6425572Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6425672Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6426056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.6426155Z return fn(*args, **kwargs) 2025-05-07T20:31:57.6426745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6426851Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6427219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6427452Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6427803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6427896Z kernel = self.compile( 2025-05-07T20:31:57.6428296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6428474Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6428610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6428614Z 2025-05-07T20:31:57.6428833Z self = 2025-05-07T20:31:57.6429634Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6430157Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44767a5da0>} 2025-05-07T20:31:57.6430929Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6431208Z context = 2025-05-07T20:31:57.6431212Z 2025-05-07T20:31:57.6431380Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6431657Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6431770Z module_map=module_map) 2025-05-07T20:31:57.6431937Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6432035Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6432122Z E ^ 2025-05-07T20:31:57.6432485Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6432490Z 2025-05-07T20:31:57.6432926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6432930Z 2025-05-07T20:31:57.6433034Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6433269Z self=, 2025-05-07T20:31:57.6433353Z T=128, 2025-05-07T20:31:57.6433428Z D=7168, 2025-05-07T20:31:57.6433512Z scale_ub=1200.0, 2025-05-07T20:31:57.6433611Z contiguous=False, 2025-05-07T20:31:57.6433697Z compiled=False, 2025-05-07T20:31:57.6433775Z ) 2025-05-07T20:31:57.6433997Z self = 2025-05-07T20:31:57.6434175Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:57.6434180Z 2025-05-07T20:31:57.6434262Z @given( 2025-05-07T20:31:57.6434385Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6434484Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6434605Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6434722Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6434839Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6434921Z ) 2025-05-07T20:31:57.6435171Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6435269Z def test_silu_mul_quant( 2025-05-07T20:31:57.6435427Z self, 2025-05-07T20:31:57.6435506Z T: int, 2025-05-07T20:31:57.6435597Z D: int, 2025-05-07T20:31:57.6435695Z scale_ub: Optional[float], 2025-05-07T20:31:57.6435789Z contiguous: bool, 2025-05-07T20:31:57.6435876Z compiled: bool, 2025-05-07T20:31:57.6435954Z ) -> None: 2025-05-07T20:31:57.6436055Z torch.manual_seed(2025) 2025-05-07T20:31:57.6436128Z 2025-05-07T20:31:57.6436299Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6436382Z 2025-05-07T20:31:57.6436477Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6436601Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6436698Z x = x_sign * x_clamp 2025-05-07T20:31:57.6436783Z x0 = x[:, :D] 2025-05-07T20:31:57.6436870Z x1 = x[:, D:] 2025-05-07T20:31:57.6436942Z 2025-05-07T20:31:57.6437029Z if contiguous: 2025-05-07T20:31:57.6437126Z x0 = x0.contiguous() 2025-05-07T20:31:57.6437221Z x1 = x1.contiguous() 2025-05-07T20:31:57.6437294Z 2025-05-07T20:31:57.6437393Z if scale_ub is not None: 2025-05-07T20:31:57.6437499Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6437635Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6437721Z ) 2025-05-07T20:31:57.6437797Z else: 2025-05-07T20:31:57.6437892Z scale_ub_tensor = None 2025-05-07T20:31:57.6437970Z 2025-05-07T20:31:57.6438102Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6438192Z op = silu_mul_quant 2025-05-07T20:31:57.6438285Z if compiled: 2025-05-07T20:31:57.6438387Z op = torch.compile(op) 2025-05-07T20:31:57.6438583Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6438657Z 2025-05-07T20:31:57.6438750Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6438754Z 2025-05-07T20:31:57.6438859Z moe/activation_test.py:117: 2025-05-07T20:31:57.6438997Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6439099Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6439203Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6439714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6439818Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6440185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6440411Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6440765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6440864Z kernel = self.compile( 2025-05-07T20:31:57.6441261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6441444Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6441574Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6441578Z 2025-05-07T20:31:57.6441790Z self = 2025-05-07T20:31:57.6442638Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6443155Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4475e7ab60>} 2025-05-07T20:31:57.6444019Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6444212Z context = 2025-05-07T20:31:57.6444217Z 2025-05-07T20:31:57.6444390Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6444661Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6444778Z module_map=module_map) 2025-05-07T20:31:57.6444941Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6445037Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6445125Z E ^ 2025-05-07T20:31:57.6445490Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6445499Z 2025-05-07T20:31:57.6445930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6445934Z 2025-05-07T20:31:57.6446044Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6446271Z self=, 2025-05-07T20:31:57.6446354Z T=128, 2025-05-07T20:31:57.6446431Z D=5120, 2025-05-07T20:31:57.6446513Z scale_ub=None, 2025-05-07T20:31:57.6446605Z contiguous=False, 2025-05-07T20:31:57.6446691Z compiled=False, 2025-05-07T20:31:57.6446762Z ) 2025-05-07T20:31:57.6446993Z self = 2025-05-07T20:31:57.6447167Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:57.6447171Z 2025-05-07T20:31:57.6447246Z @given( 2025-05-07T20:31:57.6447480Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6447579Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6447703Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6447826Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6447940Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6448020Z ) 2025-05-07T20:31:57.6448270Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6448362Z def test_silu_mul_quant( 2025-05-07T20:31:57.6448444Z self, 2025-05-07T20:31:57.6448519Z T: int, 2025-05-07T20:31:57.6448595Z D: int, 2025-05-07T20:31:57.6448704Z scale_ub: Optional[float], 2025-05-07T20:31:57.6448793Z contiguous: bool, 2025-05-07T20:31:57.6448877Z compiled: bool, 2025-05-07T20:31:57.6448962Z ) -> None: 2025-05-07T20:31:57.6449056Z torch.manual_seed(2025) 2025-05-07T20:31:57.6449136Z 2025-05-07T20:31:57.6449315Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6449393Z 2025-05-07T20:31:57.6449491Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6449624Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6449712Z x = x_sign * x_clamp 2025-05-07T20:31:57.6449799Z x0 = x[:, :D] 2025-05-07T20:31:57.6449880Z x1 = x[:, D:] 2025-05-07T20:31:57.6449951Z 2025-05-07T20:31:57.6450042Z if contiguous: 2025-05-07T20:31:57.6450134Z x0 = x0.contiguous() 2025-05-07T20:31:57.6450223Z x1 = x1.contiguous() 2025-05-07T20:31:57.6450301Z 2025-05-07T20:31:57.6450392Z if scale_ub is not None: 2025-05-07T20:31:57.6450504Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6450641Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6450717Z ) 2025-05-07T20:31:57.6450799Z else: 2025-05-07T20:31:57.6450900Z scale_ub_tensor = None 2025-05-07T20:31:57.6450971Z 2025-05-07T20:31:57.6451106Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6451196Z op = silu_mul_quant 2025-05-07T20:31:57.6451365Z if compiled: 2025-05-07T20:31:57.6451475Z op = torch.compile(op) 2025-05-07T20:31:57.6451585Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6451675Z 2025-05-07T20:31:57.6451857Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6451863Z 2025-05-07T20:31:57.6451964Z moe/activation_test.py:117: 2025-05-07T20:31:57.6452102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6452204Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6452303Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6452819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6452920Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6453288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6453528Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6453878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6453981Z kernel = self.compile( 2025-05-07T20:31:57.6454377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6454558Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6454694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6454699Z 2025-05-07T20:31:57.6454904Z self = 2025-05-07T20:31:57.6455715Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6456320Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4475e7b060>} 2025-05-07T20:31:57.6457091Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6457287Z context = 2025-05-07T20:31:57.6457292Z 2025-05-07T20:31:57.6457460Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6457736Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6457845Z module_map=module_map) 2025-05-07T20:31:57.6458009Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6458113Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6458194Z E ^ 2025-05-07T20:31:57.6458565Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6458570Z 2025-05-07T20:31:57.6458996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6459000Z 2025-05-07T20:31:57.6459103Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6459337Z self=, 2025-05-07T20:31:57.6459414Z T=128, 2025-05-07T20:31:57.6459489Z D=5120, 2025-05-07T20:31:57.6459579Z scale_ub=1200.0, 2025-05-07T20:31:57.6459664Z contiguous=True, 2025-05-07T20:31:57.6459754Z compiled=False, 2025-05-07T20:31:57.6459832Z ) 2025-05-07T20:31:57.6460054Z self = 2025-05-07T20:31:57.6460235Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:57.6460318Z 2025-05-07T20:31:57.6460395Z @given( 2025-05-07T20:31:57.6460515Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6460622Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6460737Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6460855Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6460975Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6461048Z ) 2025-05-07T20:31:57.6461304Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6461412Z def test_silu_mul_quant( 2025-05-07T20:31:57.6461497Z self, 2025-05-07T20:31:57.6461596Z T: int, 2025-05-07T20:31:57.6461688Z D: int, 2025-05-07T20:31:57.6461786Z scale_ub: Optional[float], 2025-05-07T20:31:57.6461881Z contiguous: bool, 2025-05-07T20:31:57.6461967Z compiled: bool, 2025-05-07T20:31:57.6462045Z ) -> None: 2025-05-07T20:31:57.6462151Z torch.manual_seed(2025) 2025-05-07T20:31:57.6462223Z 2025-05-07T20:31:57.6462396Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6462476Z 2025-05-07T20:31:57.6462570Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6462699Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6462791Z x = x_sign * x_clamp 2025-05-07T20:31:57.6462870Z x0 = x[:, :D] 2025-05-07T20:31:57.6462954Z x1 = x[:, D:] 2025-05-07T20:31:57.6463027Z 2025-05-07T20:31:57.6463110Z if contiguous: 2025-05-07T20:31:57.6463208Z x0 = x0.contiguous() 2025-05-07T20:31:57.6463297Z x1 = x1.contiguous() 2025-05-07T20:31:57.6463452Z 2025-05-07T20:31:57.6463550Z if scale_ub is not None: 2025-05-07T20:31:57.6463658Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6463794Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6463882Z ) 2025-05-07T20:31:57.6463958Z else: 2025-05-07T20:31:57.6464051Z scale_ub_tensor = None 2025-05-07T20:31:57.6464129Z 2025-05-07T20:31:57.6464258Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6464355Z op = silu_mul_quant 2025-05-07T20:31:57.6464439Z if compiled: 2025-05-07T20:31:57.6464538Z op = torch.compile(op) 2025-05-07T20:31:57.6464652Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6464724Z 2025-05-07T20:31:57.6464815Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6464819Z 2025-05-07T20:31:57.6464924Z moe/activation_test.py:117: 2025-05-07T20:31:57.6465056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6465164Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6465269Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6465787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6465891Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6466258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6466484Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6466840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6466935Z kernel = self.compile( 2025-05-07T20:31:57.6467337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6467515Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6467651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6467656Z 2025-05-07T20:31:57.6467951Z self = 2025-05-07T20:31:57.6468756Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6469279Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4475e78180>} 2025-05-07T20:31:57.6470050Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6470247Z context = 2025-05-07T20:31:57.6470251Z 2025-05-07T20:31:57.6470427Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6470702Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6470815Z module_map=module_map) 2025-05-07T20:31:57.6470981Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6471081Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6471166Z E ^ 2025-05-07T20:31:57.6471530Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6471534Z 2025-05-07T20:31:57.6471961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6471973Z 2025-05-07T20:31:57.6472094Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6472430Z self=, 2025-05-07T20:31:57.6472513Z T=1, 2025-05-07T20:31:57.6472589Z D=7168, 2025-05-07T20:31:57.6472675Z scale_ub=1200.0, 2025-05-07T20:31:57.6472771Z contiguous=True, 2025-05-07T20:31:57.6472854Z compiled=True, 2025-05-07T20:31:57.6472926Z ) 2025-05-07T20:31:57.6473157Z self = 2025-05-07T20:31:57.6473324Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:57.6473329Z 2025-05-07T20:31:57.6473410Z @given( 2025-05-07T20:31:57.6473530Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6473630Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6473756Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6473873Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6473987Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6474069Z ) 2025-05-07T20:31:57.6474321Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6474412Z def test_silu_mul_quant( 2025-05-07T20:31:57.6474502Z self, 2025-05-07T20:31:57.6474579Z T: int, 2025-05-07T20:31:57.6474654Z D: int, 2025-05-07T20:31:57.6474762Z scale_ub: Optional[float], 2025-05-07T20:31:57.6474853Z contiguous: bool, 2025-05-07T20:31:57.6474943Z compiled: bool, 2025-05-07T20:31:57.6475020Z ) -> None: 2025-05-07T20:31:57.6475115Z torch.manual_seed(2025) 2025-05-07T20:31:57.6475192Z 2025-05-07T20:31:57.6475363Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6475435Z 2025-05-07T20:31:57.6475532Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6475658Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6475745Z x = x_sign * x_clamp 2025-05-07T20:31:57.6475837Z x0 = x[:, :D] 2025-05-07T20:31:57.6475917Z x1 = x[:, D:] 2025-05-07T20:31:57.6475990Z 2025-05-07T20:31:57.6476079Z if contiguous: 2025-05-07T20:31:57.6476171Z x0 = x0.contiguous() 2025-05-07T20:31:57.6476358Z x1 = x1.contiguous() 2025-05-07T20:31:57.6476432Z 2025-05-07T20:31:57.6476523Z if scale_ub is not None: 2025-05-07T20:31:57.6476636Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6476773Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6476849Z ) 2025-05-07T20:31:57.6476933Z else: 2025-05-07T20:31:57.6477027Z scale_ub_tensor = None 2025-05-07T20:31:57.6477101Z 2025-05-07T20:31:57.6477238Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6477327Z op = silu_mul_quant 2025-05-07T20:31:57.6477413Z if compiled: 2025-05-07T20:31:57.6477519Z op = torch.compile(op) 2025-05-07T20:31:57.6477626Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6477706Z 2025-05-07T20:31:57.6477796Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6477801Z 2025-05-07T20:31:57.6477897Z moe/activation_test.py:117: 2025-05-07T20:31:57.6478040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6478140Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6478242Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6478625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.6478717Z return fn(*args, **kwargs) 2025-05-07T20:31:57.6479226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6479330Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6479700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6480039Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6480399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6480494Z kernel = self.compile( 2025-05-07T20:31:57.6480895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6481073Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6481208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6481213Z 2025-05-07T20:31:57.6481419Z self = 2025-05-07T20:31:57.6482220Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6482804Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4475e7aca0>} 2025-05-07T20:31:57.6483577Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6483775Z context = 2025-05-07T20:31:57.6483779Z 2025-05-07T20:31:57.6483949Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6484221Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6484335Z module_map=module_map) 2025-05-07T20:31:57.6484500Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6484614Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6484691Z E ^ 2025-05-07T20:31:57.6485132Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6485137Z 2025-05-07T20:31:57.6485573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6485578Z 2025-05-07T20:31:57.6485681Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6485917Z self=, 2025-05-07T20:31:57.6485993Z T=1, 2025-05-07T20:31:57.6486069Z D=7168, 2025-05-07T20:31:57.6486158Z scale_ub=1200.0, 2025-05-07T20:31:57.6486245Z contiguous=False, 2025-05-07T20:31:57.6486329Z compiled=True, 2025-05-07T20:31:57.6486412Z ) 2025-05-07T20:31:57.6486637Z self = 2025-05-07T20:31:57.6486811Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:57.6486816Z 2025-05-07T20:31:57.6486902Z @given( 2025-05-07T20:31:57.6487031Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6487137Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6487252Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6487372Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6487492Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6487567Z ) 2025-05-07T20:31:57.6487819Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6487919Z def test_silu_mul_quant( 2025-05-07T20:31:57.6487994Z self, 2025-05-07T20:31:57.6488070Z T: int, 2025-05-07T20:31:57.6488153Z D: int, 2025-05-07T20:31:57.6488251Z scale_ub: Optional[float], 2025-05-07T20:31:57.6488339Z contiguous: bool, 2025-05-07T20:31:57.6488509Z compiled: bool, 2025-05-07T20:31:57.6488587Z ) -> None: 2025-05-07T20:31:57.6488688Z torch.manual_seed(2025) 2025-05-07T20:31:57.6488760Z 2025-05-07T20:31:57.6488937Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6489015Z 2025-05-07T20:31:57.6489106Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6489230Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6489324Z x = x_sign * x_clamp 2025-05-07T20:31:57.6489403Z x0 = x[:, :D] 2025-05-07T20:31:57.6489482Z x1 = x[:, D:] 2025-05-07T20:31:57.6489560Z 2025-05-07T20:31:57.6489643Z if contiguous: 2025-05-07T20:31:57.6489735Z x0 = x0.contiguous() 2025-05-07T20:31:57.6489828Z x1 = x1.contiguous() 2025-05-07T20:31:57.6489900Z 2025-05-07T20:31:57.6489991Z if scale_ub is not None: 2025-05-07T20:31:57.6490102Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6490245Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6490326Z ) 2025-05-07T20:31:57.6490401Z else: 2025-05-07T20:31:57.6490494Z scale_ub_tensor = None 2025-05-07T20:31:57.6490577Z 2025-05-07T20:31:57.6490707Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6490797Z op = silu_mul_quant 2025-05-07T20:31:57.6490887Z if compiled: 2025-05-07T20:31:57.6490987Z op = torch.compile(op) 2025-05-07T20:31:57.6491091Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6491172Z 2025-05-07T20:31:57.6491260Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6491264Z 2025-05-07T20:31:57.6491366Z moe/activation_test.py:117: 2025-05-07T20:31:57.6491498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6491597Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6491702Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6492146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.6492239Z return fn(*args, **kwargs) 2025-05-07T20:31:57.6492835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6492934Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6493306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6493534Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6493883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6493984Z kernel = self.compile( 2025-05-07T20:31:57.6494376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6494559Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6494694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6494698Z 2025-05-07T20:31:57.6494911Z self = 2025-05-07T20:31:57.6495718Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6496233Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f447521cea0>} 2025-05-07T20:31:57.6497025Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6497297Z context = 2025-05-07T20:31:57.6497301Z 2025-05-07T20:31:57.6503707Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6504007Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6504117Z module_map=module_map) 2025-05-07T20:31:57.6504290Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6504386Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6504471Z E ^ 2025-05-07T20:31:57.6504838Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6504843Z 2025-05-07T20:31:57.6505283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6505293Z 2025-05-07T20:31:57.6505399Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6505628Z self=, 2025-05-07T20:31:57.6505713Z T=1, 2025-05-07T20:31:57.6505795Z D=7168, 2025-05-07T20:31:57.6505881Z scale_ub=None, 2025-05-07T20:31:57.6505977Z contiguous=False, 2025-05-07T20:31:57.6506061Z compiled=True, 2025-05-07T20:31:57.6506415Z ) 2025-05-07T20:31:57.6506732Z self = 2025-05-07T20:31:57.6506917Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:57.6506922Z 2025-05-07T20:31:57.6507010Z @given( 2025-05-07T20:31:57.6507135Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6507235Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6507358Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6507475Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6507597Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6507681Z ) 2025-05-07T20:31:57.6507936Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6508276Z def test_silu_mul_quant( 2025-05-07T20:31:57.6508357Z self, 2025-05-07T20:31:57.6508436Z T: int, 2025-05-07T20:31:57.6508520Z D: int, 2025-05-07T20:31:57.6508619Z scale_ub: Optional[float], 2025-05-07T20:31:57.6508710Z contiguous: bool, 2025-05-07T20:31:57.6508805Z compiled: bool, 2025-05-07T20:31:57.6508886Z ) -> None: 2025-05-07T20:31:57.6508982Z torch.manual_seed(2025) 2025-05-07T20:31:57.6509066Z 2025-05-07T20:31:57.6509240Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6509314Z 2025-05-07T20:31:57.6509419Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6509545Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6509647Z x = x_sign * x_clamp 2025-05-07T20:31:57.6509730Z x0 = x[:, :D] 2025-05-07T20:31:57.6509811Z x1 = x[:, D:] 2025-05-07T20:31:57.6509887Z 2025-05-07T20:31:57.6509972Z if contiguous: 2025-05-07T20:31:57.6510069Z x0 = x0.contiguous() 2025-05-07T20:31:57.6510168Z x1 = x1.contiguous() 2025-05-07T20:31:57.6510240Z 2025-05-07T20:31:57.6510336Z if scale_ub is not None: 2025-05-07T20:31:57.6510443Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6510582Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6510665Z ) 2025-05-07T20:31:57.6510740Z else: 2025-05-07T20:31:57.6510835Z scale_ub_tensor = None 2025-05-07T20:31:57.6510914Z 2025-05-07T20:31:57.6511043Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6511136Z op = silu_mul_quant 2025-05-07T20:31:57.6511226Z if compiled: 2025-05-07T20:31:57.6511459Z op = torch.compile(op) 2025-05-07T20:31:57.6511565Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6511647Z 2025-05-07T20:31:57.6511755Z y_fp8, y_scale = fn() 2025-05-07T20:31:57.6511905Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:57.6511993Z 2025-05-07T20:31:57.6512130Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6512240Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:57.6512339Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:57.6512462Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:57.6512609Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.6512682Z 2025-05-07T20:31:57.6512783Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:57.6512788Z 2025-05-07T20:31:57.6512897Z moe/activation_test.py:126: 2025-05-07T20:31:57.6513031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6513150Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:57.6513289Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.6513877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:57.6513986Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:57.6514360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6514587Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6514977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:57.6515239Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:57.6515632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:57.6515806Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:57.6516266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:57.6516351Z fn() 2025-05-07T20:31:57.6516766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:57.6516854Z self.fn.run( 2025-05-07T20:31:57.6517202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6517295Z kernel = self.compile( 2025-05-07T20:31:57.6517691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6517868Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6518001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6518010Z 2025-05-07T20:31:57.6518225Z self = 2025-05-07T20:31:57.6519039Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6519561Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4475caa3e0>} 2025-05-07T20:31:57.6520335Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6520534Z context = 2025-05-07T20:31:57.6520616Z 2025-05-07T20:31:57.6520784Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6521057Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6521175Z module_map=module_map) 2025-05-07T20:31:57.6521339Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6521441Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:57.6521525Z E ^ 2025-05-07T20:31:57.6521891Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6521896Z 2025-05-07T20:31:57.6522328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6522333Z 2025-05-07T20:31:57.6522435Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6522663Z self=, 2025-05-07T20:31:57.6522751Z T=1, 2025-05-07T20:31:57.6522829Z D=5120, 2025-05-07T20:31:57.6522912Z scale_ub=1200.0, 2025-05-07T20:31:57.6523005Z contiguous=False, 2025-05-07T20:31:57.6523089Z compiled=True, 2025-05-07T20:31:57.6523173Z ) 2025-05-07T20:31:57.6523397Z self = 2025-05-07T20:31:57.6523570Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:57.6523574Z 2025-05-07T20:31:57.6523655Z @given( 2025-05-07T20:31:57.6523775Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6523873Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6523993Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6524111Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6524230Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6524304Z ) 2025-05-07T20:31:57.6524559Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6524659Z def test_silu_mul_quant( 2025-05-07T20:31:57.6524735Z self, 2025-05-07T20:31:57.6524811Z T: int, 2025-05-07T20:31:57.6524976Z D: int, 2025-05-07T20:31:57.6525074Z scale_ub: Optional[float], 2025-05-07T20:31:57.6525161Z contiguous: bool, 2025-05-07T20:31:57.6525251Z compiled: bool, 2025-05-07T20:31:57.6525329Z ) -> None: 2025-05-07T20:31:57.6525423Z torch.manual_seed(2025) 2025-05-07T20:31:57.6525502Z 2025-05-07T20:31:57.6525673Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6525745Z 2025-05-07T20:31:57.6525844Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6525968Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6526062Z x = x_sign * x_clamp 2025-05-07T20:31:57.6526141Z x0 = x[:, :D] 2025-05-07T20:31:57.6526221Z x1 = x[:, D:] 2025-05-07T20:31:57.6526308Z 2025-05-07T20:31:57.6526392Z if contiguous: 2025-05-07T20:31:57.6526487Z x0 = x0.contiguous() 2025-05-07T20:31:57.6526584Z x1 = x1.contiguous() 2025-05-07T20:31:57.6526657Z 2025-05-07T20:31:57.6526752Z if scale_ub is not None: 2025-05-07T20:31:57.6526863Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6527001Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6527077Z ) 2025-05-07T20:31:57.6527162Z else: 2025-05-07T20:31:57.6527254Z scale_ub_tensor = None 2025-05-07T20:31:57.6527333Z 2025-05-07T20:31:57.6527465Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6527554Z op = silu_mul_quant 2025-05-07T20:31:57.6527645Z if compiled: 2025-05-07T20:31:57.6527745Z op = torch.compile(op) 2025-05-07T20:31:57.6527851Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6528013Z 2025-05-07T20:31:57.6528103Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6528107Z 2025-05-07T20:31:57.6528204Z moe/activation_test.py:117: 2025-05-07T20:31:57.6528351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6528454Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6528561Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6528938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.6529028Z return fn(*args, **kwargs) 2025-05-07T20:31:57.6529545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6529642Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6530010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6530244Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6530601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6530700Z kernel = self.compile( 2025-05-07T20:31:57.6531098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6531278Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6531416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6531421Z 2025-05-07T20:31:57.6531629Z self = 2025-05-07T20:31:57.6532523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6533049Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44767672e0>} 2025-05-07T20:31:57.6533906Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6534111Z context = 2025-05-07T20:31:57.6534115Z 2025-05-07T20:31:57.6534282Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6534559Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6534666Z module_map=module_map) 2025-05-07T20:31:57.6534830Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6534933Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6535014Z E ^ 2025-05-07T20:31:57.6535379Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6535389Z 2025-05-07T20:31:57.6535823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6535828Z 2025-05-07T20:31:57.6535931Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6536166Z self=, 2025-05-07T20:31:57.6536242Z T=1, 2025-05-07T20:31:57.6536318Z D=5120, 2025-05-07T20:31:57.6536407Z scale_ub=1200.0, 2025-05-07T20:31:57.6536496Z contiguous=False, 2025-05-07T20:31:57.6536579Z compiled=False, 2025-05-07T20:31:57.6536657Z ) 2025-05-07T20:31:57.6536881Z self = 2025-05-07T20:31:57.6537058Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:57.6537145Z 2025-05-07T20:31:57.6537222Z @given( 2025-05-07T20:31:57.6537344Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6537457Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6537580Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6537698Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6537818Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6537892Z ) 2025-05-07T20:31:57.6538152Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6538244Z def test_silu_mul_quant( 2025-05-07T20:31:57.6538320Z self, 2025-05-07T20:31:57.6538401Z T: int, 2025-05-07T20:31:57.6538476Z D: int, 2025-05-07T20:31:57.6538572Z scale_ub: Optional[float], 2025-05-07T20:31:57.6538668Z contiguous: bool, 2025-05-07T20:31:57.6538752Z compiled: bool, 2025-05-07T20:31:57.6538836Z ) -> None: 2025-05-07T20:31:57.6538937Z torch.manual_seed(2025) 2025-05-07T20:31:57.6539009Z 2025-05-07T20:31:57.6539179Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6539265Z 2025-05-07T20:31:57.6539364Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6539487Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6539586Z x = x_sign * x_clamp 2025-05-07T20:31:57.6539666Z x0 = x[:, :D] 2025-05-07T20:31:57.6539752Z x1 = x[:, D:] 2025-05-07T20:31:57.6539825Z 2025-05-07T20:31:57.6539908Z if contiguous: 2025-05-07T20:31:57.6540007Z x0 = x0.contiguous() 2025-05-07T20:31:57.6540095Z x1 = x1.contiguous() 2025-05-07T20:31:57.6540169Z 2025-05-07T20:31:57.6540267Z if scale_ub is not None: 2025-05-07T20:31:57.6540371Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6540508Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6540598Z ) 2025-05-07T20:31:57.6540686Z else: 2025-05-07T20:31:57.6540813Z scale_ub_tensor = None 2025-05-07T20:31:57.6540922Z 2025-05-07T20:31:57.6541197Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6541330Z op = silu_mul_quant 2025-05-07T20:31:57.6541450Z if compiled: 2025-05-07T20:31:57.6541566Z op = torch.compile(op) 2025-05-07T20:31:57.6541678Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6541755Z 2025-05-07T20:31:57.6541847Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6541852Z 2025-05-07T20:31:57.6541956Z moe/activation_test.py:117: 2025-05-07T20:31:57.6542088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6542189Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6542298Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6542812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6542923Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6543298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6543525Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6543880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6543973Z kernel = self.compile( 2025-05-07T20:31:57.6544371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6544554Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6544684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6544689Z 2025-05-07T20:31:57.6544904Z self = 2025-05-07T20:31:57.6545808Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6546331Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44762d6ca0>} 2025-05-07T20:31:57.6547102Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6547295Z context = 2025-05-07T20:31:57.6547300Z 2025-05-07T20:31:57.6547475Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6547750Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6547862Z module_map=module_map) 2025-05-07T20:31:57.6548030Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6548128Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6548216Z E ^ 2025-05-07T20:31:57.6548580Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6548585Z 2025-05-07T20:31:57.6549011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6549025Z 2025-05-07T20:31:57.6549128Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6549356Z self=, 2025-05-07T20:31:57.6549439Z T=16384, 2025-05-07T20:31:57.6549515Z D=5120, 2025-05-07T20:31:57.6549603Z scale_ub=1200.0, 2025-05-07T20:31:57.6549696Z contiguous=False, 2025-05-07T20:31:57.6549780Z compiled=True, 2025-05-07T20:31:57.6549854Z ) 2025-05-07T20:31:57.6550082Z self = 2025-05-07T20:31:57.6550368Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:57.6550373Z 2025-05-07T20:31:57.6550449Z @given( 2025-05-07T20:31:57.6550574Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6550672Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6550795Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6550912Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6551025Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6551103Z ) 2025-05-07T20:31:57.6551360Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6551453Z def test_silu_mul_quant( 2025-05-07T20:31:57.6551543Z self, 2025-05-07T20:31:57.6551619Z T: int, 2025-05-07T20:31:57.6551695Z D: int, 2025-05-07T20:31:57.6551797Z scale_ub: Optional[float], 2025-05-07T20:31:57.6551892Z contiguous: bool, 2025-05-07T20:31:57.6551984Z compiled: bool, 2025-05-07T20:31:57.6552062Z ) -> None: 2025-05-07T20:31:57.6552158Z torch.manual_seed(2025) 2025-05-07T20:31:57.6552236Z 2025-05-07T20:31:57.6552406Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6552480Z 2025-05-07T20:31:57.6552581Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6552706Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6552795Z x = x_sign * x_clamp 2025-05-07T20:31:57.6552883Z x0 = x[:, :D] 2025-05-07T20:31:57.6552962Z x1 = x[:, D:] 2025-05-07T20:31:57.6553036Z 2025-05-07T20:31:57.6553123Z if contiguous: 2025-05-07T20:31:57.6553214Z x0 = x0.contiguous() 2025-05-07T20:31:57.6553388Z x1 = x1.contiguous() 2025-05-07T20:31:57.6553467Z 2025-05-07T20:31:57.6553559Z if scale_ub is not None: 2025-05-07T20:31:57.6553671Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6553814Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6553889Z ) 2025-05-07T20:31:57.6553969Z else: 2025-05-07T20:31:57.6554063Z scale_ub_tensor = None 2025-05-07T20:31:57.6554134Z 2025-05-07T20:31:57.6554270Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6554359Z op = silu_mul_quant 2025-05-07T20:31:57.6554445Z if compiled: 2025-05-07T20:31:57.6554550Z op = torch.compile(op) 2025-05-07T20:31:57.6554655Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6554726Z 2025-05-07T20:31:57.6554822Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6554826Z 2025-05-07T20:31:57.6554928Z moe/activation_test.py:117: 2025-05-07T20:31:57.6555064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6555164Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6555267Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6555649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.6555742Z return fn(*args, **kwargs) 2025-05-07T20:31:57.6556252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6556356Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6556721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6556953Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6557304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6557404Z kernel = self.compile( 2025-05-07T20:31:57.6557896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6558075Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6558216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6558220Z 2025-05-07T20:31:57.6558428Z self = 2025-05-07T20:31:57.6559232Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6559757Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44764ef380>} 2025-05-07T20:31:57.6560549Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6560750Z context = 2025-05-07T20:31:57.6560754Z 2025-05-07T20:31:57.6560923Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6561195Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6561308Z module_map=module_map) 2025-05-07T20:31:57.6561475Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6561580Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6561657Z E ^ 2025-05-07T20:31:57.6562026Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6562108Z 2025-05-07T20:31:57.6562553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6562563Z 2025-05-07T20:31:57.6562668Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6562906Z self=, 2025-05-07T20:31:57.6562983Z T=2048, 2025-05-07T20:31:57.6563061Z D=7168, 2025-05-07T20:31:57.6563154Z scale_ub=1200.0, 2025-05-07T20:31:57.6563241Z contiguous=False, 2025-05-07T20:31:57.6563329Z compiled=True, 2025-05-07T20:31:57.6563409Z ) 2025-05-07T20:31:57.6563636Z self = 2025-05-07T20:31:57.6563819Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:57.6563823Z 2025-05-07T20:31:57.6563907Z @given( 2025-05-07T20:31:57.6564026Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6564139Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6564253Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6564375Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6564494Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6564568Z ) 2025-05-07T20:31:57.6564827Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6564926Z def test_silu_mul_quant( 2025-05-07T20:31:57.6565002Z self, 2025-05-07T20:31:57.6565079Z T: int, 2025-05-07T20:31:57.6565175Z D: int, 2025-05-07T20:31:57.6565274Z scale_ub: Optional[float], 2025-05-07T20:31:57.6565373Z contiguous: bool, 2025-05-07T20:31:57.6565460Z compiled: bool, 2025-05-07T20:31:57.6565538Z ) -> None: 2025-05-07T20:31:57.6565643Z torch.manual_seed(2025) 2025-05-07T20:31:57.6565714Z 2025-05-07T20:31:57.6565889Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6565969Z 2025-05-07T20:31:57.6566062Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6566277Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6566368Z x = x_sign * x_clamp 2025-05-07T20:31:57.6566450Z x0 = x[:, :D] 2025-05-07T20:31:57.6566537Z x1 = x[:, D:] 2025-05-07T20:31:57.6566610Z 2025-05-07T20:31:57.6566694Z if contiguous: 2025-05-07T20:31:57.6566796Z x0 = x0.contiguous() 2025-05-07T20:31:57.6566886Z x1 = x1.contiguous() 2025-05-07T20:31:57.6566959Z 2025-05-07T20:31:57.6567058Z if scale_ub is not None: 2025-05-07T20:31:57.6567165Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6567302Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6567387Z ) 2025-05-07T20:31:57.6567465Z else: 2025-05-07T20:31:57.6567566Z scale_ub_tensor = None 2025-05-07T20:31:57.6567644Z 2025-05-07T20:31:57.6567775Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6567873Z op = silu_mul_quant 2025-05-07T20:31:57.6567959Z if compiled: 2025-05-07T20:31:57.6568065Z op = torch.compile(op) 2025-05-07T20:31:57.6568182Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6568255Z 2025-05-07T20:31:57.6568346Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6568350Z 2025-05-07T20:31:57.6568454Z moe/activation_test.py:117: 2025-05-07T20:31:57.6568584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6568692Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6568791Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6569171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.6569270Z return fn(*args, **kwargs) 2025-05-07T20:31:57.6569870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6569970Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6570352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6570582Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6570943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6571036Z kernel = self.compile( 2025-05-07T20:31:57.6571432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6571620Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6571901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6571921Z 2025-05-07T20:31:57.6572140Z self = 2025-05-07T20:31:57.6572971Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6573493Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44764efd80>} 2025-05-07T20:31:57.6574284Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6574479Z context = 2025-05-07T20:31:57.6574483Z 2025-05-07T20:31:57.6574656Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6574935Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6575132Z module_map=module_map) 2025-05-07T20:31:57.6575306Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6575404Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6575480Z E ^ 2025-05-07T20:31:57.6575857Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6575861Z 2025-05-07T20:31:57.6576296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6576300Z 2025-05-07T20:31:57.6576409Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6576641Z self=, 2025-05-07T20:31:57.6576722Z T=1, 2025-05-07T20:31:57.6576806Z D=5120, 2025-05-07T20:31:57.6576888Z scale_ub=None, 2025-05-07T20:31:57.6576974Z contiguous=False, 2025-05-07T20:31:57.6577067Z compiled=False, 2025-05-07T20:31:57.6577138Z ) 2025-05-07T20:31:57.6577376Z self = 2025-05-07T20:31:57.6577547Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:57.6577551Z 2025-05-07T20:31:57.6577626Z @given( 2025-05-07T20:31:57.6577751Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6577848Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6577961Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6578085Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6578199Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6578274Z ) 2025-05-07T20:31:57.6578535Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6578742Z def test_silu_mul_quant( 2025-05-07T20:31:57.6578823Z self, 2025-05-07T20:31:57.6578901Z T: int, 2025-05-07T20:31:57.6578978Z D: int, 2025-05-07T20:31:57.6579085Z scale_ub: Optional[float], 2025-05-07T20:31:57.6579175Z contiguous: bool, 2025-05-07T20:31:57.6579259Z compiled: bool, 2025-05-07T20:31:57.6579345Z ) -> None: 2025-05-07T20:31:57.6579439Z torch.manual_seed(2025) 2025-05-07T20:31:57.6579511Z 2025-05-07T20:31:57.6579689Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6580194Z 2025-05-07T20:31:57.6580294Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6580420Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6580510Z x = x_sign * x_clamp 2025-05-07T20:31:57.6580595Z x0 = x[:, :D] 2025-05-07T20:31:57.6580676Z x1 = x[:, D:] 2025-05-07T20:31:57.6580749Z 2025-05-07T20:31:57.6580845Z if contiguous: 2025-05-07T20:31:57.6580936Z x0 = x0.contiguous() 2025-05-07T20:31:57.6581025Z x1 = x1.contiguous() 2025-05-07T20:31:57.6581106Z 2025-05-07T20:31:57.6581196Z if scale_ub is not None: 2025-05-07T20:31:57.6581309Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6581453Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6581529Z ) 2025-05-07T20:31:57.6581604Z else: 2025-05-07T20:31:57.6581708Z scale_ub_tensor = None 2025-05-07T20:31:57.6581780Z 2025-05-07T20:31:57.6581922Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6582016Z op = silu_mul_quant 2025-05-07T20:31:57.6582101Z if compiled: 2025-05-07T20:31:57.6582208Z op = torch.compile(op) 2025-05-07T20:31:57.6582313Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6582385Z 2025-05-07T20:31:57.6582482Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6582491Z 2025-05-07T20:31:57.6582592Z moe/activation_test.py:117: 2025-05-07T20:31:57.6582725Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6582922Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6583025Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6583554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6583651Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6584024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6584262Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6584620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6584715Z kernel = self.compile( 2025-05-07T20:31:57.6585124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6585303Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6585446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6585451Z 2025-05-07T20:31:57.6585659Z self = 2025-05-07T20:31:57.6586479Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6587009Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4476edf4c0>} 2025-05-07T20:31:57.6587794Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6588082Z context = 2025-05-07T20:31:57.6588086Z 2025-05-07T20:31:57.6588255Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6588533Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6588640Z module_map=module_map) 2025-05-07T20:31:57.6588806Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6588911Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6588987Z E ^ 2025-05-07T20:31:57.6589357Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6589361Z 2025-05-07T20:31:57.6589800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6589811Z 2025-05-07T20:31:57.6589914Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6590159Z self=, 2025-05-07T20:31:57.6590236Z T=4096, 2025-05-07T20:31:57.6590311Z D=7168, 2025-05-07T20:31:57.6590402Z scale_ub=1200.0, 2025-05-07T20:31:57.6590488Z contiguous=False, 2025-05-07T20:31:57.6590571Z compiled=False, 2025-05-07T20:31:57.6590653Z ) 2025-05-07T20:31:57.6590879Z self = 2025-05-07T20:31:57.6591061Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:57.6591072Z 2025-05-07T20:31:57.6591148Z @given( 2025-05-07T20:31:57.6591268Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6591374Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6591494Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6591614Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6591732Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6591805Z ) 2025-05-07T20:31:57.6592139Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6592242Z def test_silu_mul_quant( 2025-05-07T20:31:57.6592321Z self, 2025-05-07T20:31:57.6592404Z T: int, 2025-05-07T20:31:57.6592482Z D: int, 2025-05-07T20:31:57.6592580Z scale_ub: Optional[float], 2025-05-07T20:31:57.6592673Z contiguous: bool, 2025-05-07T20:31:57.6592758Z compiled: bool, 2025-05-07T20:31:57.6592836Z ) -> None: 2025-05-07T20:31:57.6592936Z torch.manual_seed(2025) 2025-05-07T20:31:57.6593009Z 2025-05-07T20:31:57.6593181Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6593259Z 2025-05-07T20:31:57.6593357Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6593485Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6593579Z x = x_sign * x_clamp 2025-05-07T20:31:57.6593660Z x0 = x[:, :D] 2025-05-07T20:31:57.6593746Z x1 = x[:, D:] 2025-05-07T20:31:57.6593823Z 2025-05-07T20:31:57.6593906Z if contiguous: 2025-05-07T20:31:57.6594003Z x0 = x0.contiguous() 2025-05-07T20:31:57.6594093Z x1 = x1.contiguous() 2025-05-07T20:31:57.6594168Z 2025-05-07T20:31:57.6594267Z if scale_ub is not None: 2025-05-07T20:31:57.6594372Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6594510Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6594591Z ) 2025-05-07T20:31:57.6594671Z else: 2025-05-07T20:31:57.6594770Z scale_ub_tensor = None 2025-05-07T20:31:57.6594849Z 2025-05-07T20:31:57.6594980Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6595154Z op = silu_mul_quant 2025-05-07T20:31:57.6595250Z if compiled: 2025-05-07T20:31:57.6595349Z op = torch.compile(op) 2025-05-07T20:31:57.6595469Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6595540Z 2025-05-07T20:31:57.6595630Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6595635Z 2025-05-07T20:31:57.6595741Z moe/activation_test.py:117: 2025-05-07T20:31:57.6595874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6595980Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6596086Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6596606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6596702Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6597082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6597317Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6597685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6597781Z kernel = self.compile( 2025-05-07T20:31:57.6598178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6598364Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6598494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6598499Z 2025-05-07T20:31:57.6598714Z self = 2025-05-07T20:31:57.6599536Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6600145Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4476b58540>} 2025-05-07T20:31:57.6600940Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6601135Z context = 2025-05-07T20:31:57.6601139Z 2025-05-07T20:31:57.6601317Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6601590Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6601698Z module_map=module_map) 2025-05-07T20:31:57.6601871Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6601977Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6602063Z E ^ 2025-05-07T20:31:57.6602442Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6602446Z 2025-05-07T20:31:57.6602880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6602885Z 2025-05-07T20:31:57.6602993Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6603227Z self=, 2025-05-07T20:31:57.6603310Z T=16384, 2025-05-07T20:31:57.6603386Z D=7168, 2025-05-07T20:31:57.6603467Z scale_ub=None, 2025-05-07T20:31:57.6603558Z contiguous=True, 2025-05-07T20:31:57.6603640Z compiled=True, 2025-05-07T20:31:57.6603713Z ) 2025-05-07T20:31:57.6603945Z self = 2025-05-07T20:31:57.6604206Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:57.6604210Z 2025-05-07T20:31:57.6604285Z @given( 2025-05-07T20:31:57.6604412Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6604518Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6604639Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6604757Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6604872Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6604953Z ) 2025-05-07T20:31:57.6605209Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6605302Z def test_silu_mul_quant( 2025-05-07T20:31:57.6605384Z self, 2025-05-07T20:31:57.6605462Z T: int, 2025-05-07T20:31:57.6605538Z D: int, 2025-05-07T20:31:57.6605641Z scale_ub: Optional[float], 2025-05-07T20:31:57.6605730Z contiguous: bool, 2025-05-07T20:31:57.6605820Z compiled: bool, 2025-05-07T20:31:57.6605903Z ) -> None: 2025-05-07T20:31:57.6605997Z torch.manual_seed(2025) 2025-05-07T20:31:57.6606068Z 2025-05-07T20:31:57.6606550Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6606655Z 2025-05-07T20:31:57.6606791Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6606928Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6607020Z x = x_sign * x_clamp 2025-05-07T20:31:57.6607106Z x0 = x[:, :D] 2025-05-07T20:31:57.6607186Z x1 = x[:, D:] 2025-05-07T20:31:57.6607260Z 2025-05-07T20:31:57.6607350Z if contiguous: 2025-05-07T20:31:57.6607443Z x0 = x0.contiguous() 2025-05-07T20:31:57.6607533Z x1 = x1.contiguous() 2025-05-07T20:31:57.6607610Z 2025-05-07T20:31:57.6607703Z if scale_ub is not None: 2025-05-07T20:31:57.6607808Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6607961Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6608035Z ) 2025-05-07T20:31:57.6608117Z else: 2025-05-07T20:31:57.6608211Z scale_ub_tensor = None 2025-05-07T20:31:57.6608515Z 2025-05-07T20:31:57.6608656Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6608748Z op = silu_mul_quant 2025-05-07T20:31:57.6608832Z if compiled: 2025-05-07T20:31:57.6608938Z op = torch.compile(op) 2025-05-07T20:31:57.6609042Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6609115Z 2025-05-07T20:31:57.6609210Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6609215Z 2025-05-07T20:31:57.6609312Z moe/activation_test.py:117: 2025-05-07T20:31:57.6609454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6609555Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6609658Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6610048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.6610140Z return fn(*args, **kwargs) 2025-05-07T20:31:57.6610655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6610761Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6611129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6611365Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6611718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6611893Z kernel = self.compile( 2025-05-07T20:31:57.6612298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6612646Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6612777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6612787Z 2025-05-07T20:31:57.6613002Z self = 2025-05-07T20:31:57.6613805Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6614329Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4477b899e0>} 2025-05-07T20:31:57.6615102Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6615306Z context = 2025-05-07T20:31:57.6615311Z 2025-05-07T20:31:57.6615482Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6615752Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6615867Z module_map=module_map) 2025-05-07T20:31:57.6616032Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6616131Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6616214Z E ^ 2025-05-07T20:31:57.6616579Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6616583Z 2025-05-07T20:31:57.6617018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6617025Z 2025-05-07T20:31:57.6617128Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6617359Z self=, 2025-05-07T20:31:57.6617443Z T=4096, 2025-05-07T20:31:57.6617607Z D=5120, 2025-05-07T20:31:57.6617697Z scale_ub=None, 2025-05-07T20:31:57.6617784Z contiguous=False, 2025-05-07T20:31:57.6617868Z compiled=True, 2025-05-07T20:31:57.6617946Z ) 2025-05-07T20:31:57.6618169Z self = 2025-05-07T20:31:57.6618348Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:57.6618352Z 2025-05-07T20:31:57.6618434Z @given( 2025-05-07T20:31:57.6618554Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6618652Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6618775Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6618894Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6619022Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6619096Z ) 2025-05-07T20:31:57.6619347Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6619450Z def test_silu_mul_quant( 2025-05-07T20:31:57.6619526Z self, 2025-05-07T20:31:57.6619603Z T: int, 2025-05-07T20:31:57.6619685Z D: int, 2025-05-07T20:31:57.6619785Z scale_ub: Optional[float], 2025-05-07T20:31:57.6619877Z contiguous: bool, 2025-05-07T20:31:57.6619969Z compiled: bool, 2025-05-07T20:31:57.6620046Z ) -> None: 2025-05-07T20:31:57.6620139Z torch.manual_seed(2025) 2025-05-07T20:31:57.6620216Z 2025-05-07T20:31:57.6620385Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6620465Z 2025-05-07T20:31:57.6620557Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6620680Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6620862Z x = x_sign * x_clamp 2025-05-07T20:31:57.6620943Z x0 = x[:, :D] 2025-05-07T20:31:57.6621022Z x1 = x[:, D:] 2025-05-07T20:31:57.6621104Z 2025-05-07T20:31:57.6621187Z if contiguous: 2025-05-07T20:31:57.6621289Z x0 = x0.contiguous() 2025-05-07T20:31:57.6621387Z x1 = x1.contiguous() 2025-05-07T20:31:57.6621460Z 2025-05-07T20:31:57.6621549Z if scale_ub is not None: 2025-05-07T20:31:57.6621664Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6621802Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6621879Z ) 2025-05-07T20:31:57.6621962Z else: 2025-05-07T20:31:57.6622057Z scale_ub_tensor = None 2025-05-07T20:31:57.6622134Z 2025-05-07T20:31:57.6622264Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6622352Z op = silu_mul_quant 2025-05-07T20:31:57.6622442Z if compiled: 2025-05-07T20:31:57.6622546Z op = torch.compile(op) 2025-05-07T20:31:57.6622650Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6622729Z 2025-05-07T20:31:57.6622819Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6622824Z 2025-05-07T20:31:57.6622924Z moe/activation_test.py:117: 2025-05-07T20:31:57.6623060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6623161Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6623267Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6623641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.6623737Z return fn(*args, **kwargs) 2025-05-07T20:31:57.6624256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6624354Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6624719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6624955Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6625404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6625506Z kernel = self.compile( 2025-05-07T20:31:57.6625900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6632447Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6632600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6632606Z 2025-05-07T20:31:57.6632827Z self = 2025-05-07T20:31:57.6633638Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6634183Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4477b5fba0>} 2025-05-07T20:31:57.6634961Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6635157Z context = 2025-05-07T20:31:57.6635161Z 2025-05-07T20:31:57.6635341Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6635618Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6635738Z module_map=module_map) 2025-05-07T20:31:57.6636018Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6636120Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6636207Z E ^ 2025-05-07T20:31:57.6636579Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6636584Z 2025-05-07T20:31:57.6637013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6637027Z 2025-05-07T20:31:57.6637138Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6637369Z self=, 2025-05-07T20:31:57.6637455Z T=4096, 2025-05-07T20:31:57.6637533Z D=5120, 2025-05-07T20:31:57.6637620Z scale_ub=1200.0, 2025-05-07T20:31:57.6637723Z contiguous=False, 2025-05-07T20:31:57.6637809Z compiled=False, 2025-05-07T20:31:57.6637882Z ) 2025-05-07T20:31:57.6638113Z self = 2025-05-07T20:31:57.6638300Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:57.6638305Z 2025-05-07T20:31:57.6638393Z @given( 2025-05-07T20:31:57.6638521Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6638623Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6638748Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6638866Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6638981Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6639065Z ) 2025-05-07T20:31:57.6639318Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6639414Z def test_silu_mul_quant( 2025-05-07T20:31:57.6639500Z self, 2025-05-07T20:31:57.6639578Z T: int, 2025-05-07T20:31:57.6639664Z D: int, 2025-05-07T20:31:57.6639764Z scale_ub: Optional[float], 2025-05-07T20:31:57.6639860Z contiguous: bool, 2025-05-07T20:31:57.6639949Z compiled: bool, 2025-05-07T20:31:57.6640028Z ) -> None: 2025-05-07T20:31:57.6640124Z torch.manual_seed(2025) 2025-05-07T20:31:57.6640289Z 2025-05-07T20:31:57.6640464Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6640546Z 2025-05-07T20:31:57.6640640Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6640767Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6640863Z x = x_sign * x_clamp 2025-05-07T20:31:57.6640943Z x0 = x[:, :D] 2025-05-07T20:31:57.6641025Z x1 = x[:, D:] 2025-05-07T20:31:57.6641107Z 2025-05-07T20:31:57.6641194Z if contiguous: 2025-05-07T20:31:57.6641285Z x0 = x0.contiguous() 2025-05-07T20:31:57.6641383Z x1 = x1.contiguous() 2025-05-07T20:31:57.6641457Z 2025-05-07T20:31:57.6641548Z if scale_ub is not None: 2025-05-07T20:31:57.6641665Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6641803Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6641885Z ) 2025-05-07T20:31:57.6641963Z else: 2025-05-07T20:31:57.6642064Z scale_ub_tensor = None 2025-05-07T20:31:57.6642148Z 2025-05-07T20:31:57.6642282Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6642372Z op = silu_mul_quant 2025-05-07T20:31:57.6642465Z if compiled: 2025-05-07T20:31:57.6642566Z op = torch.compile(op) 2025-05-07T20:31:57.6642671Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6642752Z 2025-05-07T20:31:57.6642843Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6642847Z 2025-05-07T20:31:57.6642945Z moe/activation_test.py:117: 2025-05-07T20:31:57.6643086Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6643188Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6643379Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6643893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6643995Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6644378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6644605Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6644954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6645055Z kernel = self.compile( 2025-05-07T20:31:57.6645449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6645631Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6645766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6645771Z 2025-05-07T20:31:57.6645978Z self = 2025-05-07T20:31:57.6646794Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6647311Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4477b5d440>} 2025-05-07T20:31:57.6648088Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6648280Z context = 2025-05-07T20:31:57.6648288Z 2025-05-07T20:31:57.6648461Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6648844Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6648953Z module_map=module_map) 2025-05-07T20:31:57.6649125Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6649227Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6649303Z E ^ 2025-05-07T20:31:57.6649673Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6649678Z 2025-05-07T20:31:57.6650105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6650109Z 2025-05-07T20:31:57.6650219Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6650447Z self=, 2025-05-07T20:31:57.6650529Z T=4096, 2025-05-07T20:31:57.6650612Z D=5120, 2025-05-07T20:31:57.6650694Z scale_ub=1200.0, 2025-05-07T20:31:57.6650781Z contiguous=False, 2025-05-07T20:31:57.6650879Z compiled=True, 2025-05-07T20:31:57.6650952Z ) 2025-05-07T20:31:57.6651175Z self = 2025-05-07T20:31:57.6651360Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:57.6651365Z 2025-05-07T20:31:57.6651440Z @given( 2025-05-07T20:31:57.6651567Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6651668Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6651898Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6652024Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6652138Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6652211Z ) 2025-05-07T20:31:57.6652554Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6652649Z def test_silu_mul_quant( 2025-05-07T20:31:57.6652729Z self, 2025-05-07T20:31:57.6652805Z T: int, 2025-05-07T20:31:57.6652888Z D: int, 2025-05-07T20:31:57.6652990Z scale_ub: Optional[float], 2025-05-07T20:31:57.6653082Z contiguous: bool, 2025-05-07T20:31:57.6653168Z compiled: bool, 2025-05-07T20:31:57.6653252Z ) -> None: 2025-05-07T20:31:57.6653348Z torch.manual_seed(2025) 2025-05-07T20:31:57.6653421Z 2025-05-07T20:31:57.6653595Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6653668Z 2025-05-07T20:31:57.6653759Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6653889Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6653977Z x = x_sign * x_clamp 2025-05-07T20:31:57.6654057Z x0 = x[:, :D] 2025-05-07T20:31:57.6654152Z x1 = x[:, D:] 2025-05-07T20:31:57.6654225Z 2025-05-07T20:31:57.6654320Z if contiguous: 2025-05-07T20:31:57.6654412Z x0 = x0.contiguous() 2025-05-07T20:31:57.6654503Z x1 = x1.contiguous() 2025-05-07T20:31:57.6654586Z 2025-05-07T20:31:57.6654681Z if scale_ub is not None: 2025-05-07T20:31:57.6654788Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6654933Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6655008Z ) 2025-05-07T20:31:57.6655086Z else: 2025-05-07T20:31:57.6655187Z scale_ub_tensor = None 2025-05-07T20:31:57.6655260Z 2025-05-07T20:31:57.6655390Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6655488Z op = silu_mul_quant 2025-05-07T20:31:57.6655574Z if compiled: 2025-05-07T20:31:57.6655682Z op = torch.compile(op) 2025-05-07T20:31:57.6655786Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6655864Z 2025-05-07T20:31:57.6655962Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6655966Z 2025-05-07T20:31:57.6656064Z moe/activation_test.py:117: 2025-05-07T20:31:57.6656280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6656391Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6656491Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6656869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.6656969Z return fn(*args, **kwargs) 2025-05-07T20:31:57.6657479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6657580Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6657947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6658182Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6658541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6658640Z kernel = self.compile( 2025-05-07T20:31:57.6659039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6659217Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6659347Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6659351Z 2025-05-07T20:31:57.6659563Z self = 2025-05-07T20:31:57.6660370Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6660970Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44779ab920>} 2025-05-07T20:31:57.6661746Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6661941Z context = 2025-05-07T20:31:57.6661945Z 2025-05-07T20:31:57.6662118Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6662391Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6662504Z module_map=module_map) 2025-05-07T20:31:57.6662670Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6662768Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6662859Z E ^ 2025-05-07T20:31:57.6663223Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6663227Z 2025-05-07T20:31:57.6663666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6663670Z 2025-05-07T20:31:57.6663773Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6664001Z self=, 2025-05-07T20:31:57.6664083Z T=2048, 2025-05-07T20:31:57.6664160Z D=7168, 2025-05-07T20:31:57.6664242Z scale_ub=1200.0, 2025-05-07T20:31:57.6664334Z contiguous=False, 2025-05-07T20:31:57.6664418Z compiled=False, 2025-05-07T20:31:57.6664489Z ) 2025-05-07T20:31:57.6664720Z self = 2025-05-07T20:31:57.6664901Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:57.6664911Z 2025-05-07T20:31:57.6664992Z @given( 2025-05-07T20:31:57.6665112Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6665291Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6665413Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6665530Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6665644Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6665724Z ) 2025-05-07T20:31:57.6665977Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6666069Z def test_silu_mul_quant( 2025-05-07T20:31:57.6666150Z self, 2025-05-07T20:31:57.6666226Z T: int, 2025-05-07T20:31:57.6666309Z D: int, 2025-05-07T20:31:57.6666407Z scale_ub: Optional[float], 2025-05-07T20:31:57.6666496Z contiguous: bool, 2025-05-07T20:31:57.6666586Z compiled: bool, 2025-05-07T20:31:57.6666669Z ) -> None: 2025-05-07T20:31:57.6666765Z torch.manual_seed(2025) 2025-05-07T20:31:57.6666842Z 2025-05-07T20:31:57.6667012Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6667089Z 2025-05-07T20:31:57.6667186Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6667310Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6667400Z x = x_sign * x_clamp 2025-05-07T20:31:57.6667487Z x0 = x[:, :D] 2025-05-07T20:31:57.6667567Z x1 = x[:, D:] 2025-05-07T20:31:57.6667639Z 2025-05-07T20:31:57.6667729Z if contiguous: 2025-05-07T20:31:57.6667821Z x0 = x0.contiguous() 2025-05-07T20:31:57.6667917Z x1 = x1.contiguous() 2025-05-07T20:31:57.6667989Z 2025-05-07T20:31:57.6668080Z if scale_ub is not None: 2025-05-07T20:31:57.6668189Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6668326Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6668484Z ) 2025-05-07T20:31:57.6668568Z else: 2025-05-07T20:31:57.6668662Z scale_ub_tensor = None 2025-05-07T20:31:57.6668737Z 2025-05-07T20:31:57.6668878Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6668968Z op = silu_mul_quant 2025-05-07T20:31:57.6669053Z if compiled: 2025-05-07T20:31:57.6669159Z op = torch.compile(op) 2025-05-07T20:31:57.6669263Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6669343Z 2025-05-07T20:31:57.6669433Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6669437Z 2025-05-07T20:31:57.6669534Z moe/activation_test.py:117: 2025-05-07T20:31:57.6669671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6669772Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6669871Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6670392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6670495Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6670876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6671103Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6671455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6671555Z kernel = self.compile( 2025-05-07T20:31:57.6671979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6672181Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6672317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6672321Z 2025-05-07T20:31:57.6672532Z self = 2025-05-07T20:31:57.6673422Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6673939Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44840996c0>} 2025-05-07T20:31:57.6674718Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6674910Z context = 2025-05-07T20:31:57.6674914Z 2025-05-07T20:31:57.6675082Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6675366Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6675474Z module_map=module_map) 2025-05-07T20:31:57.6675643Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6675749Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6675829Z E ^ 2025-05-07T20:31:57.6676202Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6676206Z 2025-05-07T20:31:57.6676632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6676636Z 2025-05-07T20:31:57.6676740Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6676973Z self=, 2025-05-07T20:31:57.6677049Z T=1, 2025-05-07T20:31:57.6677134Z D=7168, 2025-05-07T20:31:57.6677299Z scale_ub=None, 2025-05-07T20:31:57.6677388Z contiguous=True, 2025-05-07T20:31:57.6677478Z compiled=False, 2025-05-07T20:31:57.6677551Z ) 2025-05-07T20:31:57.6677779Z self = 2025-05-07T20:31:57.6677954Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:57.6677959Z 2025-05-07T20:31:57.6678036Z @given( 2025-05-07T20:31:57.6678155Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6678263Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6678379Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6678503Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6678617Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6678691Z ) 2025-05-07T20:31:57.6678947Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6679046Z def test_silu_mul_quant( 2025-05-07T20:31:57.6679122Z self, 2025-05-07T20:31:57.6679205Z T: int, 2025-05-07T20:31:57.6679281Z D: int, 2025-05-07T20:31:57.6679379Z scale_ub: Optional[float], 2025-05-07T20:31:57.6679480Z contiguous: bool, 2025-05-07T20:31:57.6679566Z compiled: bool, 2025-05-07T20:31:57.6679645Z ) -> None: 2025-05-07T20:31:57.6679744Z torch.manual_seed(2025) 2025-05-07T20:31:57.6679816Z 2025-05-07T20:31:57.6679996Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6680070Z 2025-05-07T20:31:57.6680160Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6680290Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6680381Z x = x_sign * x_clamp 2025-05-07T20:31:57.6680461Z x0 = x[:, :D] 2025-05-07T20:31:57.6680547Z x1 = x[:, D:] 2025-05-07T20:31:57.6680618Z 2025-05-07T20:31:57.6680705Z if contiguous: 2025-05-07T20:31:57.6680809Z x0 = x0.contiguous() 2025-05-07T20:31:57.6680899Z x1 = x1.contiguous() 2025-05-07T20:31:57.6680970Z 2025-05-07T20:31:57.6681066Z if scale_ub is not None: 2025-05-07T20:31:57.6681278Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6681417Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6681504Z ) 2025-05-07T20:31:57.6681579Z else: 2025-05-07T20:31:57.6681679Z scale_ub_tensor = None 2025-05-07T20:31:57.6681751Z 2025-05-07T20:31:57.6681880Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6681976Z op = silu_mul_quant 2025-05-07T20:31:57.6682060Z if compiled: 2025-05-07T20:31:57.6682179Z op = torch.compile(op) 2025-05-07T20:31:57.6682302Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6682389Z 2025-05-07T20:31:57.6682478Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6682483Z 2025-05-07T20:31:57.6682591Z moe/activation_test.py:117: 2025-05-07T20:31:57.6682721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6682826Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6682934Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6683446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6683547Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6683914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6684138Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6684494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6684589Z kernel = self.compile( 2025-05-07T20:31:57.6684991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6685252Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6685390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6685394Z 2025-05-07T20:31:57.6685606Z self = 2025-05-07T20:31:57.6686410Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6686931Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4484503c40>} 2025-05-07T20:31:57.6687701Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6687899Z context = 2025-05-07T20:31:57.6687910Z 2025-05-07T20:31:57.6688083Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6688353Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6688466Z module_map=module_map) 2025-05-07T20:31:57.6688631Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6688731Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6688816Z E ^ 2025-05-07T20:31:57.6689179Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6689184Z 2025-05-07T20:31:57.6689616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6689626Z 2025-05-07T20:31:57.6689729Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6690041Z self=, 2025-05-07T20:31:57.6690128Z T=16384, 2025-05-07T20:31:57.6690212Z D=7168, 2025-05-07T20:31:57.6690299Z scale_ub=1200.0, 2025-05-07T20:31:57.6690394Z contiguous=False, 2025-05-07T20:31:57.6690477Z compiled=True, 2025-05-07T20:31:57.6690551Z ) 2025-05-07T20:31:57.6690782Z self = 2025-05-07T20:31:57.6690967Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:57.6690972Z 2025-05-07T20:31:57.6691057Z @given( 2025-05-07T20:31:57.6691178Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6691278Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6691398Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6691520Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6691637Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6691717Z ) 2025-05-07T20:31:57.6692078Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6692178Z def test_silu_mul_quant( 2025-05-07T20:31:57.6692256Z self, 2025-05-07T20:31:57.6692342Z T: int, 2025-05-07T20:31:57.6692440Z D: int, 2025-05-07T20:31:57.6692549Z scale_ub: Optional[float], 2025-05-07T20:31:57.6692649Z contiguous: bool, 2025-05-07T20:31:57.6692740Z compiled: bool, 2025-05-07T20:31:57.6692818Z ) -> None: 2025-05-07T20:31:57.6692915Z torch.manual_seed(2025) 2025-05-07T20:31:57.6692992Z 2025-05-07T20:31:57.6693162Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6693252Z 2025-05-07T20:31:57.6693345Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6693555Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6693653Z x = x_sign * x_clamp 2025-05-07T20:31:57.6693734Z x0 = x[:, :D] 2025-05-07T20:31:57.6693815Z x1 = x[:, D:] 2025-05-07T20:31:57.6693904Z 2025-05-07T20:31:57.6693988Z if contiguous: 2025-05-07T20:31:57.6694080Z x0 = x0.contiguous() 2025-05-07T20:31:57.6694178Z x1 = x1.contiguous() 2025-05-07T20:31:57.6694251Z 2025-05-07T20:31:57.6694347Z if scale_ub is not None: 2025-05-07T20:31:57.6694454Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6694595Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6694680Z ) 2025-05-07T20:31:57.6694756Z else: 2025-05-07T20:31:57.6694852Z scale_ub_tensor = None 2025-05-07T20:31:57.6694933Z 2025-05-07T20:31:57.6695063Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6695154Z op = silu_mul_quant 2025-05-07T20:31:57.6695251Z if compiled: 2025-05-07T20:31:57.6695350Z op = torch.compile(op) 2025-05-07T20:31:57.6695461Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6695544Z 2025-05-07T20:31:57.6695643Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6695647Z 2025-05-07T20:31:57.6695751Z moe/activation_test.py:117: 2025-05-07T20:31:57.6695885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6695986Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6696095Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6696473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.6696567Z return fn(*args, **kwargs) 2025-05-07T20:31:57.6697081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6697182Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6697554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6697862Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6698215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6698317Z kernel = self.compile( 2025-05-07T20:31:57.6698712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6698895Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6699026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6699030Z 2025-05-07T20:31:57.6699237Z self = 2025-05-07T20:31:57.6700049Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6700575Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4484a9da80>} 2025-05-07T20:31:57.6701349Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6701540Z context = 2025-05-07T20:31:57.6701544Z 2025-05-07T20:31:57.6701711Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6702006Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6702217Z module_map=module_map) 2025-05-07T20:31:57.6702393Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6702492Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6702573Z E ^ 2025-05-07T20:31:57.6702943Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6702947Z 2025-05-07T20:31:57.6703372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6703376Z 2025-05-07T20:31:57.6703485Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6703713Z self=, 2025-05-07T20:31:57.6703790Z T=1, 2025-05-07T20:31:57.6703875Z D=7168, 2025-05-07T20:31:57.6703957Z scale_ub=None, 2025-05-07T20:31:57.6704042Z contiguous=False, 2025-05-07T20:31:57.6704138Z compiled=False, 2025-05-07T20:31:57.6704210Z ) 2025-05-07T20:31:57.6704435Z self = 2025-05-07T20:31:57.6704614Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:57.6704623Z 2025-05-07T20:31:57.6704701Z @given( 2025-05-07T20:31:57.6704829Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6704927Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6705041Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6705166Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6705279Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6705351Z ) 2025-05-07T20:31:57.6705610Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6705701Z def test_silu_mul_quant( 2025-05-07T20:31:57.6705779Z self, 2025-05-07T20:31:57.6705860Z T: int, 2025-05-07T20:31:57.6705939Z D: int, 2025-05-07T20:31:57.6706037Z scale_ub: Optional[float], 2025-05-07T20:31:57.6706138Z contiguous: bool, 2025-05-07T20:31:57.6706514Z compiled: bool, 2025-05-07T20:31:57.6706634Z ) -> None: 2025-05-07T20:31:57.6707021Z torch.manual_seed(2025) 2025-05-07T20:31:57.6707106Z 2025-05-07T20:31:57.6707286Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6707362Z 2025-05-07T20:31:57.6707456Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6707587Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6707677Z x = x_sign * x_clamp 2025-05-07T20:31:57.6707756Z x0 = x[:, :D] 2025-05-07T20:31:57.6707841Z x1 = x[:, D:] 2025-05-07T20:31:57.6707912Z 2025-05-07T20:31:57.6707995Z if contiguous: 2025-05-07T20:31:57.6708094Z x0 = x0.contiguous() 2025-05-07T20:31:57.6708182Z x1 = x1.contiguous() 2025-05-07T20:31:57.6708259Z 2025-05-07T20:31:57.6708356Z if scale_ub is not None: 2025-05-07T20:31:57.6708460Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6708604Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6708689Z ) 2025-05-07T20:31:57.6708765Z else: 2025-05-07T20:31:57.6708870Z scale_ub_tensor = None 2025-05-07T20:31:57.6708943Z 2025-05-07T20:31:57.6709073Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6709170Z op = silu_mul_quant 2025-05-07T20:31:57.6709256Z if compiled: 2025-05-07T20:31:57.6709354Z op = torch.compile(op) 2025-05-07T20:31:57.6709466Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6709539Z 2025-05-07T20:31:57.6709630Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6709641Z 2025-05-07T20:31:57.6709741Z moe/activation_test.py:117: 2025-05-07T20:31:57.6709873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6710149Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6710252Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6710774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6710880Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6711250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6711505Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6711889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6711983Z kernel = self.compile( 2025-05-07T20:31:57.6712385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6712563Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6712703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6712707Z 2025-05-07T20:31:57.6712930Z self = 2025-05-07T20:31:57.6713732Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6714258Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4484b756c0>} 2025-05-07T20:31:57.6715030Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6715234Z context = 2025-05-07T20:31:57.6715239Z 2025-05-07T20:31:57.6715408Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6715757Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6715875Z module_map=module_map) 2025-05-07T20:31:57.6716042Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6716141Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6716227Z E ^ 2025-05-07T20:31:57.6716593Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6716598Z 2025-05-07T20:31:57.6717031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6717035Z 2025-05-07T20:31:57.6717137Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6717371Z self=, 2025-05-07T20:31:57.6717453Z T=2048, 2025-05-07T20:31:57.6717529Z D=7168, 2025-05-07T20:31:57.6717611Z scale_ub=None, 2025-05-07T20:31:57.6717708Z contiguous=False, 2025-05-07T20:31:57.6717791Z compiled=True, 2025-05-07T20:31:57.6717870Z ) 2025-05-07T20:31:57.6718096Z self = 2025-05-07T20:31:57.6718276Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:57.6718280Z 2025-05-07T20:31:57.6718362Z @given( 2025-05-07T20:31:57.6718482Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6718580Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6718703Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6718819Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6718932Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6719097Z ) 2025-05-07T20:31:57.6719348Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6719446Z def test_silu_mul_quant( 2025-05-07T20:31:57.6719529Z self, 2025-05-07T20:31:57.6719604Z T: int, 2025-05-07T20:31:57.6719685Z D: int, 2025-05-07T20:31:57.6719783Z scale_ub: Optional[float], 2025-05-07T20:31:57.6719871Z contiguous: bool, 2025-05-07T20:31:57.6719961Z compiled: bool, 2025-05-07T20:31:57.6720037Z ) -> None: 2025-05-07T20:31:57.6720134Z torch.manual_seed(2025) 2025-05-07T20:31:57.6720212Z 2025-05-07T20:31:57.6720381Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6720453Z 2025-05-07T20:31:57.6720551Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6720675Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6720769Z x = x_sign * x_clamp 2025-05-07T20:31:57.6720855Z x0 = x[:, :D] 2025-05-07T20:31:57.6720934Z x1 = x[:, D:] 2025-05-07T20:31:57.6721011Z 2025-05-07T20:31:57.6721094Z if contiguous: 2025-05-07T20:31:57.6721185Z x0 = x0.contiguous() 2025-05-07T20:31:57.6721283Z x1 = x1.contiguous() 2025-05-07T20:31:57.6721354Z 2025-05-07T20:31:57.6721445Z if scale_ub is not None: 2025-05-07T20:31:57.6721558Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6721710Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6721794Z ) 2025-05-07T20:31:57.6721892Z else: 2025-05-07T20:31:57.6721996Z scale_ub_tensor = None 2025-05-07T20:31:57.6722073Z 2025-05-07T20:31:57.6722207Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6722296Z op = silu_mul_quant 2025-05-07T20:31:57.6722386Z if compiled: 2025-05-07T20:31:57.6722486Z op = torch.compile(op) 2025-05-07T20:31:57.6722597Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6722675Z 2025-05-07T20:31:57.6722764Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6722769Z 2025-05-07T20:31:57.6722868Z moe/activation_test.py:117: 2025-05-07T20:31:57.6723094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6723196Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6723302Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6723679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.6723771Z return fn(*args, **kwargs) 2025-05-07T20:31:57.6724286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6724384Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6724753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6724991Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6725347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6725447Z kernel = self.compile( 2025-05-07T20:31:57.6725841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6726019Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6726157Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6726161Z 2025-05-07T20:31:57.6726370Z self = 2025-05-07T20:31:57.6727180Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6727777Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4486dab060>} 2025-05-07T20:31:57.6728548Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6728748Z context = 2025-05-07T20:31:57.6728752Z 2025-05-07T20:31:57.6728920Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6729198Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6729306Z module_map=module_map) 2025-05-07T20:31:57.6729471Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6729583Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6729662Z E ^ 2025-05-07T20:31:57.6730031Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6730042Z 2025-05-07T20:31:57.6730469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6730474Z 2025-05-07T20:31:57.6730578Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6730817Z self=, 2025-05-07T20:31:57.6730894Z T=4096, 2025-05-07T20:31:57.6730974Z D=7168, 2025-05-07T20:31:57.6731063Z scale_ub=None, 2025-05-07T20:31:57.6731149Z contiguous=False, 2025-05-07T20:31:57.6731233Z compiled=True, 2025-05-07T20:31:57.6731313Z ) 2025-05-07T20:31:57.6731536Z self = 2025-05-07T20:31:57.6731726Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:57.6731730Z 2025-05-07T20:31:57.6731897Z @given( 2025-05-07T20:31:57.6732101Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6732206Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6732321Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6732438Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6732557Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6732631Z ) 2025-05-07T20:31:57.6732884Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6732982Z def test_silu_mul_quant( 2025-05-07T20:31:57.6733059Z self, 2025-05-07T20:31:57.6733141Z T: int, 2025-05-07T20:31:57.6733217Z D: int, 2025-05-07T20:31:57.6733313Z scale_ub: Optional[float], 2025-05-07T20:31:57.6733408Z contiguous: bool, 2025-05-07T20:31:57.6733496Z compiled: bool, 2025-05-07T20:31:57.6733574Z ) -> None: 2025-05-07T20:31:57.6733675Z torch.manual_seed(2025) 2025-05-07T20:31:57.6733746Z 2025-05-07T20:31:57.6733921Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6734001Z 2025-05-07T20:31:57.6734093Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6734218Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6734314Z x = x_sign * x_clamp 2025-05-07T20:31:57.6734394Z x0 = x[:, :D] 2025-05-07T20:31:57.6734479Z x1 = x[:, D:] 2025-05-07T20:31:57.6734553Z 2025-05-07T20:31:57.6734636Z if contiguous: 2025-05-07T20:31:57.6734736Z x0 = x0.contiguous() 2025-05-07T20:31:57.6734824Z x1 = x1.contiguous() 2025-05-07T20:31:57.6734896Z 2025-05-07T20:31:57.6734991Z if scale_ub is not None: 2025-05-07T20:31:57.6735096Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6735317Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6735400Z ) 2025-05-07T20:31:57.6735476Z else: 2025-05-07T20:31:57.6735570Z scale_ub_tensor = None 2025-05-07T20:31:57.6735653Z 2025-05-07T20:31:57.6735783Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6735880Z op = silu_mul_quant 2025-05-07T20:31:57.6735965Z if compiled: 2025-05-07T20:31:57.6736066Z op = torch.compile(op) 2025-05-07T20:31:57.6736179Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6736251Z 2025-05-07T20:31:57.6736343Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6736347Z 2025-05-07T20:31:57.6736449Z moe/activation_test.py:117: 2025-05-07T20:31:57.6736583Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6736686Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6736794Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6737175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.6737273Z return fn(*args, **kwargs) 2025-05-07T20:31:57.6737786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6737882Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6738255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6738481Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6738834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6738937Z kernel = self.compile( 2025-05-07T20:31:57.6739330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6739523Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6739659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6739743Z 2025-05-07T20:31:57.6739956Z self = 2025-05-07T20:31:57.6740765Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6741287Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4484df04a0>} 2025-05-07T20:31:57.6742065Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6742262Z context = 2025-05-07T20:31:57.6742266Z 2025-05-07T20:31:57.6742445Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6742717Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6742827Z module_map=module_map) 2025-05-07T20:31:57.6742999Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6743098Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6743177Z E ^ 2025-05-07T20:31:57.6743547Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6743551Z 2025-05-07T20:31:57.6743978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6744086Z 2025-05-07T20:31:57.6744198Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6744426Z self=, 2025-05-07T20:31:57.6744503Z T=16384, 2025-05-07T20:31:57.6744594Z D=5120, 2025-05-07T20:31:57.6744677Z scale_ub=1200.0, 2025-05-07T20:31:57.6744763Z contiguous=False, 2025-05-07T20:31:57.6744856Z compiled=False, 2025-05-07T20:31:57.6744928Z ) 2025-05-07T20:31:57.6745152Z self = 2025-05-07T20:31:57.6745346Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:57.6745350Z 2025-05-07T20:31:57.6745425Z @given( 2025-05-07T20:31:57.6745554Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6745652Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6745767Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6745894Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6746014Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6746090Z ) 2025-05-07T20:31:57.6746354Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6746453Z def test_silu_mul_quant( 2025-05-07T20:31:57.6746534Z self, 2025-05-07T20:31:57.6746611Z T: int, 2025-05-07T20:31:57.6746688Z D: int, 2025-05-07T20:31:57.6746791Z scale_ub: Optional[float], 2025-05-07T20:31:57.6746880Z contiguous: bool, 2025-05-07T20:31:57.6746964Z compiled: bool, 2025-05-07T20:31:57.6747049Z ) -> None: 2025-05-07T20:31:57.6747142Z torch.manual_seed(2025) 2025-05-07T20:31:57.6747213Z 2025-05-07T20:31:57.6747389Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6747461Z 2025-05-07T20:31:57.6747553Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6747683Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6747775Z x = x_sign * x_clamp 2025-05-07T20:31:57.6747855Z x0 = x[:, :D] 2025-05-07T20:31:57.6747941Z x1 = x[:, D:] 2025-05-07T20:31:57.6748013Z 2025-05-07T20:31:57.6748184Z if contiguous: 2025-05-07T20:31:57.6748276Z x0 = x0.contiguous() 2025-05-07T20:31:57.6748364Z x1 = x1.contiguous() 2025-05-07T20:31:57.6748441Z 2025-05-07T20:31:57.6748531Z if scale_ub is not None: 2025-05-07T20:31:57.6748639Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6748784Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6748858Z ) 2025-05-07T20:31:57.6748936Z else: 2025-05-07T20:31:57.6749037Z scale_ub_tensor = None 2025-05-07T20:31:57.6749108Z 2025-05-07T20:31:57.6749238Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6749334Z op = silu_mul_quant 2025-05-07T20:31:57.6749418Z if compiled: 2025-05-07T20:31:57.6749527Z op = torch.compile(op) 2025-05-07T20:31:57.6749633Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6749705Z 2025-05-07T20:31:57.6749802Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6749811Z 2025-05-07T20:31:57.6749909Z moe/activation_test.py:117: 2025-05-07T20:31:57.6750041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6750146Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6750244Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6750755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6750858Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6751225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6751457Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6751887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6751994Z kernel = self.compile( 2025-05-07T20:31:57.6752435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6752614Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6752755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6752760Z 2025-05-07T20:31:57.6752968Z self = 2025-05-07T20:31:57.6753771Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6754297Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44a9babd80>} 2025-05-07T20:31:57.6755072Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6755274Z context = 2025-05-07T20:31:57.6755279Z 2025-05-07T20:31:57.6755445Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6755716Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6755830Z module_map=module_map) 2025-05-07T20:31:57.6755994Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6756116Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6756201Z E ^ 2025-05-07T20:31:57.6756567Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6756571Z 2025-05-07T20:31:57.6757090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6757095Z 2025-05-07T20:31:57.6757200Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6763824Z self=, 2025-05-07T20:31:57.6763922Z T=16384, 2025-05-07T20:31:57.6764012Z D=5120, 2025-05-07T20:31:57.6764097Z scale_ub=1200.0, 2025-05-07T20:31:57.6764182Z contiguous=True, 2025-05-07T20:31:57.6764273Z compiled=True, 2025-05-07T20:31:57.6764347Z ) 2025-05-07T20:31:57.6764589Z self = 2025-05-07T20:31:57.6764772Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:57.6764789Z 2025-05-07T20:31:57.6764871Z @given( 2025-05-07T20:31:57.6765002Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6765104Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6765227Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6765357Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6765478Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6765564Z ) 2025-05-07T20:31:57.6765825Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6765922Z def test_silu_mul_quant( 2025-05-07T20:31:57.6766009Z self, 2025-05-07T20:31:57.6766091Z T: int, 2025-05-07T20:31:57.6766170Z D: int, 2025-05-07T20:31:57.6766279Z scale_ub: Optional[float], 2025-05-07T20:31:57.6766373Z contiguous: bool, 2025-05-07T20:31:57.6766460Z compiled: bool, 2025-05-07T20:31:57.6766550Z ) -> None: 2025-05-07T20:31:57.6766768Z torch.manual_seed(2025) 2025-05-07T20:31:57.6766847Z 2025-05-07T20:31:57.6767035Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6767110Z 2025-05-07T20:31:57.6767211Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6767349Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6767441Z x = x_sign * x_clamp 2025-05-07T20:31:57.6767533Z x0 = x[:, :D] 2025-05-07T20:31:57.6767615Z x1 = x[:, D:] 2025-05-07T20:31:57.6767691Z 2025-05-07T20:31:57.6767789Z if contiguous: 2025-05-07T20:31:57.6767884Z x0 = x0.contiguous() 2025-05-07T20:31:57.6767980Z x1 = x1.contiguous() 2025-05-07T20:31:57.6768067Z 2025-05-07T20:31:57.6768161Z if scale_ub is not None: 2025-05-07T20:31:57.6768269Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6768414Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6768496Z ) 2025-05-07T20:31:57.6768579Z else: 2025-05-07T20:31:57.6768676Z scale_ub_tensor = None 2025-05-07T20:31:57.6768748Z 2025-05-07T20:31:57.6768889Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6768989Z op = silu_mul_quant 2025-05-07T20:31:57.6769076Z if compiled: 2025-05-07T20:31:57.6769185Z op = torch.compile(op) 2025-05-07T20:31:57.6769295Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6769368Z 2025-05-07T20:31:57.6769469Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6769474Z 2025-05-07T20:31:57.6769577Z moe/activation_test.py:117: 2025-05-07T20:31:57.6769721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6769823Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6769928Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6770323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.6770422Z return fn(*args, **kwargs) 2025-05-07T20:31:57.6771020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6771124Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6771496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6771733Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6772204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6772300Z kernel = self.compile( 2025-05-07T20:31:57.6772706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6772888Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6773031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6773035Z 2025-05-07T20:31:57.6773246Z self = 2025-05-07T20:31:57.6774069Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6774601Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4484d772e0>} 2025-05-07T20:31:57.6775385Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6775587Z context = 2025-05-07T20:31:57.6775672Z 2025-05-07T20:31:57.6775843Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6776125Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6776239Z module_map=module_map) 2025-05-07T20:31:57.6776405Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6776509Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6776586Z E ^ 2025-05-07T20:31:57.6776957Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6776962Z 2025-05-07T20:31:57.6777402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6777406Z 2025-05-07T20:31:57.6777509Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6777744Z self=, 2025-05-07T20:31:57.6777826Z T=16384, 2025-05-07T20:31:57.6777901Z D=5120, 2025-05-07T20:31:57.6777988Z scale_ub=None, 2025-05-07T20:31:57.6778074Z contiguous=False, 2025-05-07T20:31:57.6778160Z compiled=True, 2025-05-07T20:31:57.6778241Z ) 2025-05-07T20:31:57.6778466Z self = 2025-05-07T20:31:57.6778647Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:57.6778651Z 2025-05-07T20:31:57.6778733Z @given( 2025-05-07T20:31:57.6778852Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6778956Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6779073Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6779191Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6779311Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6779390Z ) 2025-05-07T20:31:57.6779643Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6779742Z def test_silu_mul_quant( 2025-05-07T20:31:57.6779819Z self, 2025-05-07T20:31:57.6780011Z T: int, 2025-05-07T20:31:57.6780095Z D: int, 2025-05-07T20:31:57.6780192Z scale_ub: Optional[float], 2025-05-07T20:31:57.6780281Z contiguous: bool, 2025-05-07T20:31:57.6780372Z compiled: bool, 2025-05-07T20:31:57.6780450Z ) -> None: 2025-05-07T20:31:57.6780549Z torch.manual_seed(2025) 2025-05-07T20:31:57.6780628Z 2025-05-07T20:31:57.6780798Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6780876Z 2025-05-07T20:31:57.6780966Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6781092Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6781184Z x = x_sign * x_clamp 2025-05-07T20:31:57.6781268Z x0 = x[:, :D] 2025-05-07T20:31:57.6781354Z x1 = x[:, D:] 2025-05-07T20:31:57.6781430Z 2025-05-07T20:31:57.6781514Z if contiguous: 2025-05-07T20:31:57.6781605Z x0 = x0.contiguous() 2025-05-07T20:31:57.6781699Z x1 = x1.contiguous() 2025-05-07T20:31:57.6781777Z 2025-05-07T20:31:57.6781867Z if scale_ub is not None: 2025-05-07T20:31:57.6781977Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6782115Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6782195Z ) 2025-05-07T20:31:57.6782270Z else: 2025-05-07T20:31:57.6782363Z scale_ub_tensor = None 2025-05-07T20:31:57.6782440Z 2025-05-07T20:31:57.6782570Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6782660Z op = silu_mul_quant 2025-05-07T20:31:57.6782752Z if compiled: 2025-05-07T20:31:57.6782852Z op = torch.compile(op) 2025-05-07T20:31:57.6782958Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6783122Z 2025-05-07T20:31:57.6783214Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6783218Z 2025-05-07T20:31:57.6783322Z moe/activation_test.py:117: 2025-05-07T20:31:57.6783460Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6783563Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6783673Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6784052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.6784146Z return fn(*args, **kwargs) 2025-05-07T20:31:57.6784666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6784763Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6785141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6785375Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6785731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6785837Z kernel = self.compile( 2025-05-07T20:31:57.6786234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6786414Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6786557Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6786561Z 2025-05-07T20:31:57.6786770Z self = 2025-05-07T20:31:57.6787590Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6788118Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f447608f920>} 2025-05-07T20:31:57.6788983Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6789180Z context = 2025-05-07T20:31:57.6789185Z 2025-05-07T20:31:57.6789356Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6789637Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6789745Z module_map=module_map) 2025-05-07T20:31:57.6789918Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6790023Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6790101Z E ^ 2025-05-07T20:31:57.6790479Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6790488Z 2025-05-07T20:31:57.6790919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6790924Z 2025-05-07T20:31:57.6791029Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6791266Z self=, 2025-05-07T20:31:57.6791343Z T=2048, 2025-05-07T20:31:57.6791426Z D=5120, 2025-05-07T20:31:57.6791508Z scale_ub=None, 2025-05-07T20:31:57.6791592Z contiguous=False, 2025-05-07T20:31:57.6791681Z compiled=True, 2025-05-07T20:31:57.6791752Z ) 2025-05-07T20:31:57.6791979Z self = 2025-05-07T20:31:57.6792165Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:57.6792247Z 2025-05-07T20:31:57.6792323Z @given( 2025-05-07T20:31:57.6792452Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6792581Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6792721Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6792845Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6792960Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6793031Z ) 2025-05-07T20:31:57.6793292Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6793386Z def test_silu_mul_quant( 2025-05-07T20:31:57.6793460Z self, 2025-05-07T20:31:57.6793543Z T: int, 2025-05-07T20:31:57.6793620Z D: int, 2025-05-07T20:31:57.6793721Z scale_ub: Optional[float], 2025-05-07T20:31:57.6793818Z contiguous: bool, 2025-05-07T20:31:57.6793902Z compiled: bool, 2025-05-07T20:31:57.6793987Z ) -> None: 2025-05-07T20:31:57.6794087Z torch.manual_seed(2025) 2025-05-07T20:31:57.6794159Z 2025-05-07T20:31:57.6794340Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6794418Z 2025-05-07T20:31:57.6794510Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6794641Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6794729Z x = x_sign * x_clamp 2025-05-07T20:31:57.6794808Z x0 = x[:, :D] 2025-05-07T20:31:57.6794894Z x1 = x[:, D:] 2025-05-07T20:31:57.6794967Z 2025-05-07T20:31:57.6795051Z if contiguous: 2025-05-07T20:31:57.6795149Z x0 = x0.contiguous() 2025-05-07T20:31:57.6795238Z x1 = x1.contiguous() 2025-05-07T20:31:57.6795309Z 2025-05-07T20:31:57.6795403Z if scale_ub is not None: 2025-05-07T20:31:57.6795509Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6795653Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6795732Z ) 2025-05-07T20:31:57.6795811Z else: 2025-05-07T20:31:57.6795909Z scale_ub_tensor = None 2025-05-07T20:31:57.6795981Z 2025-05-07T20:31:57.6796198Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6796298Z op = silu_mul_quant 2025-05-07T20:31:57.6796383Z if compiled: 2025-05-07T20:31:57.6796484Z op = torch.compile(op) 2025-05-07T20:31:57.6796597Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6796669Z 2025-05-07T20:31:57.6796760Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6796764Z 2025-05-07T20:31:57.6796867Z moe/activation_test.py:117: 2025-05-07T20:31:57.6797000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6797105Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6797208Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6797586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.6797691Z return fn(*args, **kwargs) 2025-05-07T20:31:57.6798208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6798305Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6798682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6798912Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6799272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6799366Z kernel = self.compile( 2025-05-07T20:31:57.6799762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6800035Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6800166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6800171Z 2025-05-07T20:31:57.6800391Z self = 2025-05-07T20:31:57.6801202Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6801721Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f447608df80>} 2025-05-07T20:31:57.6802540Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6802757Z context = 2025-05-07T20:31:57.6802762Z 2025-05-07T20:31:57.6802937Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6803214Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6803322Z module_map=module_map) 2025-05-07T20:31:57.6803495Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6803593Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6803678Z E ^ 2025-05-07T20:31:57.6804045Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6804050Z 2025-05-07T20:31:57.6804480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6804485Z 2025-05-07T20:31:57.6804598Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6804828Z self=, 2025-05-07T20:31:57.6804908Z T=2048, 2025-05-07T20:31:57.6804990Z D=5120, 2025-05-07T20:31:57.6805156Z scale_ub=1200.0, 2025-05-07T20:31:57.6805252Z contiguous=False, 2025-05-07T20:31:57.6805336Z compiled=True, 2025-05-07T20:31:57.6805409Z ) 2025-05-07T20:31:57.6805641Z self = 2025-05-07T20:31:57.6805826Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:57.6805830Z 2025-05-07T20:31:57.6805906Z @given( 2025-05-07T20:31:57.6806034Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6806137Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6807132Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6807284Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6807416Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6807497Z ) 2025-05-07T20:31:57.6807757Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6807849Z def test_silu_mul_quant( 2025-05-07T20:31:57.6807938Z self, 2025-05-07T20:31:57.6808017Z T: int, 2025-05-07T20:31:57.6808093Z D: int, 2025-05-07T20:31:57.6808199Z scale_ub: Optional[float], 2025-05-07T20:31:57.6808289Z contiguous: bool, 2025-05-07T20:31:57.6808376Z compiled: bool, 2025-05-07T20:31:57.6808467Z ) -> None: 2025-05-07T20:31:57.6808561Z torch.manual_seed(2025) 2025-05-07T20:31:57.6808632Z 2025-05-07T20:31:57.6808809Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6808883Z 2025-05-07T20:31:57.6808980Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6809105Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6809194Z x = x_sign * x_clamp 2025-05-07T20:31:57.6809592Z x0 = x[:, :D] 2025-05-07T20:31:57.6809672Z x1 = x[:, D:] 2025-05-07T20:31:57.6809744Z 2025-05-07T20:31:57.6809834Z if contiguous: 2025-05-07T20:31:57.6809931Z x0 = x0.contiguous() 2025-05-07T20:31:57.6810021Z x1 = x1.contiguous() 2025-05-07T20:31:57.6810101Z 2025-05-07T20:31:57.6810191Z if scale_ub is not None: 2025-05-07T20:31:57.6810298Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6810441Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6810516Z ) 2025-05-07T20:31:57.6810591Z else: 2025-05-07T20:31:57.6810692Z scale_ub_tensor = None 2025-05-07T20:31:57.6810765Z 2025-05-07T20:31:57.6810902Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6810993Z op = silu_mul_quant 2025-05-07T20:31:57.6811076Z if compiled: 2025-05-07T20:31:57.6811183Z op = torch.compile(op) 2025-05-07T20:31:57.6811295Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6811367Z 2025-05-07T20:31:57.6811466Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6811471Z 2025-05-07T20:31:57.6811575Z moe/activation_test.py:117: 2025-05-07T20:31:57.6811708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6811916Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6812019Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6812407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.6812502Z return fn(*args, **kwargs) 2025-05-07T20:31:57.6813015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6813120Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6813490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6813725Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6814234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6814330Z kernel = self.compile( 2025-05-07T20:31:57.6814731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6814909Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6815040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6815045Z 2025-05-07T20:31:57.6815258Z self = 2025-05-07T20:31:57.6816062Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6816595Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4476fdca40>} 2025-05-07T20:31:57.6817371Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6817568Z context = 2025-05-07T20:31:57.6817573Z 2025-05-07T20:31:57.6817740Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6818010Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6818122Z module_map=module_map) 2025-05-07T20:31:57.6818286Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6818498Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6818583Z E ^ 2025-05-07T20:31:57.6818953Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6818958Z 2025-05-07T20:31:57.6819393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6819398Z 2025-05-07T20:31:57.6819501Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6819732Z self=, 2025-05-07T20:31:57.6819815Z T=4096, 2025-05-07T20:31:57.6819892Z D=5120, 2025-05-07T20:31:57.6819976Z scale_ub=1200.0, 2025-05-07T20:31:57.6820072Z contiguous=True, 2025-05-07T20:31:57.6820154Z compiled=True, 2025-05-07T20:31:57.6820239Z ) 2025-05-07T20:31:57.6820462Z self = 2025-05-07T20:31:57.6820647Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:57.6820652Z 2025-05-07T20:31:57.6820736Z @given( 2025-05-07T20:31:57.6820862Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6820961Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6821084Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6821201Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6821316Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6821398Z ) 2025-05-07T20:31:57.6821651Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6821750Z def test_silu_mul_quant( 2025-05-07T20:31:57.6821828Z self, 2025-05-07T20:31:57.6821906Z T: int, 2025-05-07T20:31:57.6821992Z D: int, 2025-05-07T20:31:57.6822091Z scale_ub: Optional[float], 2025-05-07T20:31:57.6822186Z contiguous: bool, 2025-05-07T20:31:57.6822278Z compiled: bool, 2025-05-07T20:31:57.6822361Z ) -> None: 2025-05-07T20:31:57.6822455Z torch.manual_seed(2025) 2025-05-07T20:31:57.6822538Z 2025-05-07T20:31:57.6822851Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6822928Z 2025-05-07T20:31:57.6823028Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6823153Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6823248Z x = x_sign * x_clamp 2025-05-07T20:31:57.6823328Z x0 = x[:, :D] 2025-05-07T20:31:57.6823407Z x1 = x[:, D:] 2025-05-07T20:31:57.6823483Z 2025-05-07T20:31:57.6823566Z if contiguous: 2025-05-07T20:31:57.6823657Z x0 = x0.contiguous() 2025-05-07T20:31:57.6823754Z x1 = x1.contiguous() 2025-05-07T20:31:57.6823826Z 2025-05-07T20:31:57.6823936Z if scale_ub is not None: 2025-05-07T20:31:57.6824041Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6824182Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6824265Z ) 2025-05-07T20:31:57.6824341Z else: 2025-05-07T20:31:57.6824440Z scale_ub_tensor = None 2025-05-07T20:31:57.6824521Z 2025-05-07T20:31:57.6824652Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6824743Z op = silu_mul_quant 2025-05-07T20:31:57.6824835Z if compiled: 2025-05-07T20:31:57.6824934Z op = torch.compile(op) 2025-05-07T20:31:57.6825051Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6825123Z 2025-05-07T20:31:57.6825217Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6825221Z 2025-05-07T20:31:57.6825325Z moe/activation_test.py:117: 2025-05-07T20:31:57.6825457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6825558Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6825748Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6826126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.6826219Z return fn(*args, **kwargs) 2025-05-07T20:31:57.6826744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6826840Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6827212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6827439Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6827790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6827890Z kernel = self.compile( 2025-05-07T20:31:57.6828285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6828473Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6828603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6828613Z 2025-05-07T20:31:57.6828819Z self = 2025-05-07T20:31:57.6829631Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6830147Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4476fde2a0>} 2025-05-07T20:31:57.6830929Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6831128Z context = 2025-05-07T20:31:57.6831133Z 2025-05-07T20:31:57.6831377Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6831665Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6831792Z module_map=module_map) 2025-05-07T20:31:57.6831986Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6832085Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6832161Z E ^ 2025-05-07T20:31:57.6832537Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6832542Z 2025-05-07T20:31:57.6832974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6832983Z 2025-05-07T20:31:57.6833092Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6833323Z self=, 2025-05-07T20:31:57.6833399Z T=128, 2025-05-07T20:31:57.6833485Z D=5120, 2025-05-07T20:31:57.6833569Z scale_ub=1200.0, 2025-05-07T20:31:57.6833653Z contiguous=False, 2025-05-07T20:31:57.6833741Z compiled=True, 2025-05-07T20:31:57.6833813Z ) 2025-05-07T20:31:57.6834038Z self = 2025-05-07T20:31:57.6834220Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:57.6834226Z 2025-05-07T20:31:57.6834302Z @given( 2025-05-07T20:31:57.6834428Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6834528Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6834644Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6834767Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6834963Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6835037Z ) 2025-05-07T20:31:57.6835301Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6835395Z def test_silu_mul_quant( 2025-05-07T20:31:57.6835472Z self, 2025-05-07T20:31:57.6835553Z T: int, 2025-05-07T20:31:57.6835629Z D: int, 2025-05-07T20:31:57.6835734Z scale_ub: Optional[float], 2025-05-07T20:31:57.6835823Z contiguous: bool, 2025-05-07T20:31:57.6835909Z compiled: bool, 2025-05-07T20:31:57.6835994Z ) -> None: 2025-05-07T20:31:57.6836088Z torch.manual_seed(2025) 2025-05-07T20:31:57.6836160Z 2025-05-07T20:31:57.6836336Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6836410Z 2025-05-07T20:31:57.6836502Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6836633Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6836730Z x = x_sign * x_clamp 2025-05-07T20:31:57.6836810Z x0 = x[:, :D] 2025-05-07T20:31:57.6836896Z x1 = x[:, D:] 2025-05-07T20:31:57.6836969Z 2025-05-07T20:31:57.6837056Z if contiguous: 2025-05-07T20:31:57.6837155Z x0 = x0.contiguous() 2025-05-07T20:31:57.6837244Z x1 = x1.contiguous() 2025-05-07T20:31:57.6837323Z 2025-05-07T20:31:57.6837413Z if scale_ub is not None: 2025-05-07T20:31:57.6837518Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6837665Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6837740Z ) 2025-05-07T20:31:57.6837818Z else: 2025-05-07T20:31:57.6837919Z scale_ub_tensor = None 2025-05-07T20:31:57.6837990Z 2025-05-07T20:31:57.6838121Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6838221Z op = silu_mul_quant 2025-05-07T20:31:57.6838306Z if compiled: 2025-05-07T20:31:57.6838410Z op = torch.compile(op) 2025-05-07T20:31:57.6838522Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6838593Z 2025-05-07T20:31:57.6838692Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6838777Z 2025-05-07T20:31:57.6838878Z moe/activation_test.py:117: 2025-05-07T20:31:57.6839009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6839117Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6839217Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6839597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.6839696Z return fn(*args, **kwargs) 2025-05-07T20:31:57.6840216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6840321Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6840697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6840927Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6841293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6841390Z kernel = self.compile( 2025-05-07T20:31:57.6841791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6841976Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6842105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6842110Z 2025-05-07T20:31:57.6842325Z self = 2025-05-07T20:31:57.6843146Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6843760Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44779540e0>} 2025-05-07T20:31:57.6844541Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6844733Z context = 2025-05-07T20:31:57.6844738Z 2025-05-07T20:31:57.6844912Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6845187Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6845302Z module_map=module_map) 2025-05-07T20:31:57.6845472Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6845572Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6845659Z E ^ 2025-05-07T20:31:57.6846030Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6846034Z 2025-05-07T20:31:57.6846462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6846475Z 2025-05-07T20:31:57.6846577Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6846807Z self=, 2025-05-07T20:31:57.6846892Z T=16384, 2025-05-07T20:31:57.6846969Z D=7168, 2025-05-07T20:31:57.6847052Z scale_ub=1200.0, 2025-05-07T20:31:57.6847144Z contiguous=True, 2025-05-07T20:31:57.6847226Z compiled=True, 2025-05-07T20:31:57.6847304Z ) 2025-05-07T20:31:57.6847536Z self = 2025-05-07T20:31:57.6847716Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:57.6847721Z 2025-05-07T20:31:57.6847906Z @given( 2025-05-07T20:31:57.6848034Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6848133Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6848253Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6848369Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6848482Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6848561Z ) 2025-05-07T20:31:57.6848812Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6848905Z def test_silu_mul_quant( 2025-05-07T20:31:57.6848988Z self, 2025-05-07T20:31:57.6849065Z T: int, 2025-05-07T20:31:57.6849141Z D: int, 2025-05-07T20:31:57.6849249Z scale_ub: Optional[float], 2025-05-07T20:31:57.6849337Z contiguous: bool, 2025-05-07T20:31:57.6849423Z compiled: bool, 2025-05-07T20:31:57.6849508Z ) -> None: 2025-05-07T20:31:57.6849606Z torch.manual_seed(2025) 2025-05-07T20:31:57.6849684Z 2025-05-07T20:31:57.6849852Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6849924Z 2025-05-07T20:31:57.6850022Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6850146Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6850234Z x = x_sign * x_clamp 2025-05-07T20:31:57.6850320Z x0 = x[:, :D] 2025-05-07T20:31:57.6850399Z x1 = x[:, D:] 2025-05-07T20:31:57.6850472Z 2025-05-07T20:31:57.6850563Z if contiguous: 2025-05-07T20:31:57.6850655Z x0 = x0.contiguous() 2025-05-07T20:31:57.6850744Z x1 = x1.contiguous() 2025-05-07T20:31:57.6850821Z 2025-05-07T20:31:57.6850911Z if scale_ub is not None: 2025-05-07T20:31:57.6851103Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6851239Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6851312Z ) 2025-05-07T20:31:57.6851398Z else: 2025-05-07T20:31:57.6851492Z scale_ub_tensor = None 2025-05-07T20:31:57.6851564Z 2025-05-07T20:31:57.6851698Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6851877Z op = silu_mul_quant 2025-05-07T20:31:57.6851964Z if compiled: 2025-05-07T20:31:57.6852071Z op = torch.compile(op) 2025-05-07T20:31:57.6852175Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6852246Z 2025-05-07T20:31:57.6852341Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6852346Z 2025-05-07T20:31:57.6852443Z moe/activation_test.py:117: 2025-05-07T20:31:57.6852580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6852690Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6852789Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6853170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.6853267Z return fn(*args, **kwargs) 2025-05-07T20:31:57.6853777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6853881Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6854248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6854483Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6854835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6854930Z kernel = self.compile( 2025-05-07T20:31:57.6855337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6855517Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6855734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6855747Z 2025-05-07T20:31:57.6855955Z self = 2025-05-07T20:31:57.6856763Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6857284Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4477956160>} 2025-05-07T20:31:57.6858057Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6858262Z context = 2025-05-07T20:31:57.6858270Z 2025-05-07T20:31:57.6858437Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6858708Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6858821Z module_map=module_map) 2025-05-07T20:31:57.6858989Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6859096Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6859176Z E ^ 2025-05-07T20:31:57.6859541Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6859545Z 2025-05-07T20:31:57.6859977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6860059Z 2025-05-07T20:31:57.6860165Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6860405Z self=, 2025-05-07T20:31:57.6860484Z T=16384, 2025-05-07T20:31:57.6860563Z D=5120, 2025-05-07T20:31:57.6860655Z scale_ub=1200.0, 2025-05-07T20:31:57.6860741Z contiguous=True, 2025-05-07T20:31:57.6860828Z compiled=False, 2025-05-07T20:31:57.6860908Z ) 2025-05-07T20:31:57.6861132Z self = 2025-05-07T20:31:57.6861316Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:57.6861320Z 2025-05-07T20:31:57.6861406Z @given( 2025-05-07T20:31:57.6861526Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6861627Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6861750Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6861876Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6862001Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6862079Z ) 2025-05-07T20:31:57.6862337Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6862440Z def test_silu_mul_quant( 2025-05-07T20:31:57.6862518Z self, 2025-05-07T20:31:57.6862596Z T: int, 2025-05-07T20:31:57.6862679Z D: int, 2025-05-07T20:31:57.6862777Z scale_ub: Optional[float], 2025-05-07T20:31:57.6862867Z contiguous: bool, 2025-05-07T20:31:57.6862960Z compiled: bool, 2025-05-07T20:31:57.6863040Z ) -> None: 2025-05-07T20:31:57.6863135Z torch.manual_seed(2025) 2025-05-07T20:31:57.6863217Z 2025-05-07T20:31:57.6863387Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6863467Z 2025-05-07T20:31:57.6863560Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6863688Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6863783Z x = x_sign * x_clamp 2025-05-07T20:31:57.6863865Z x0 = x[:, :D] 2025-05-07T20:31:57.6864028Z x1 = x[:, D:] 2025-05-07T20:31:57.6864112Z 2025-05-07T20:31:57.6864199Z if contiguous: 2025-05-07T20:31:57.6864292Z x0 = x0.contiguous() 2025-05-07T20:31:57.6864389Z x1 = x1.contiguous() 2025-05-07T20:31:57.6864462Z 2025-05-07T20:31:57.6864552Z if scale_ub is not None: 2025-05-07T20:31:57.6864666Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6864802Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6864883Z ) 2025-05-07T20:31:57.6864961Z else: 2025-05-07T20:31:57.6865055Z scale_ub_tensor = None 2025-05-07T20:31:57.6865134Z 2025-05-07T20:31:57.6865265Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6865362Z op = silu_mul_quant 2025-05-07T20:31:57.6865454Z if compiled: 2025-05-07T20:31:57.6865555Z op = torch.compile(op) 2025-05-07T20:31:57.6865664Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6865749Z 2025-05-07T20:31:57.6865844Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6865849Z 2025-05-07T20:31:57.6865946Z moe/activation_test.py:117: 2025-05-07T20:31:57.6866084Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6866186Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6866292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6866806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6866908Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6867284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6867591Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6867957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6868054Z kernel = self.compile( 2025-05-07T20:31:57.6868451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6868638Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6868771Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6868777Z 2025-05-07T20:31:57.6868988Z self = 2025-05-07T20:31:57.6869796Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6870315Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4477847d80>} 2025-05-07T20:31:57.6871100Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6871295Z context = 2025-05-07T20:31:57.6871300Z 2025-05-07T20:31:57.6871473Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6871744Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6871851Z module_map=module_map) 2025-05-07T20:31:57.6872023Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6872129Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6872208Z E ^ 2025-05-07T20:31:57.6872579Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6872660Z 2025-05-07T20:31:57.6873090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6873095Z 2025-05-07T20:31:57.6873208Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6873437Z self=, 2025-05-07T20:31:57.6873516Z T=1, 2025-05-07T20:31:57.6873601Z D=7168, 2025-05-07T20:31:57.6873685Z scale_ub=1200.0, 2025-05-07T20:31:57.6873775Z contiguous=False, 2025-05-07T20:31:57.6873867Z compiled=False, 2025-05-07T20:31:57.6873941Z ) 2025-05-07T20:31:57.6874173Z self = 2025-05-07T20:31:57.6874350Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:57.6874354Z 2025-05-07T20:31:57.6874433Z @given( 2025-05-07T20:31:57.6874560Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6874665Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6874782Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6874912Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6875027Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6875104Z ) 2025-05-07T20:31:57.6875364Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6875459Z def test_silu_mul_quant( 2025-05-07T20:31:57.6875544Z self, 2025-05-07T20:31:57.6875623Z T: int, 2025-05-07T20:31:57.6875701Z D: int, 2025-05-07T20:31:57.6875806Z scale_ub: Optional[float], 2025-05-07T20:31:57.6875897Z contiguous: bool, 2025-05-07T20:31:57.6876067Z compiled: bool, 2025-05-07T20:31:57.6876157Z ) -> None: 2025-05-07T20:31:57.6876253Z torch.manual_seed(2025) 2025-05-07T20:31:57.6876328Z 2025-05-07T20:31:57.6876510Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6876586Z 2025-05-07T20:31:57.6876678Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6876810Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6876899Z x = x_sign * x_clamp 2025-05-07T20:31:57.6876984Z x0 = x[:, :D] 2025-05-07T20:31:57.6877066Z x1 = x[:, D:] 2025-05-07T20:31:57.6877139Z 2025-05-07T20:31:57.6877229Z if contiguous: 2025-05-07T20:31:57.6877321Z x0 = x0.contiguous() 2025-05-07T20:31:57.6877412Z x1 = x1.contiguous() 2025-05-07T20:31:57.6877492Z 2025-05-07T20:31:57.6877584Z if scale_ub is not None: 2025-05-07T20:31:57.6877690Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6877832Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6877915Z ) 2025-05-07T20:31:57.6877992Z else: 2025-05-07T20:31:57.6878094Z scale_ub_tensor = None 2025-05-07T20:31:57.6878168Z 2025-05-07T20:31:57.6878303Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6878400Z op = silu_mul_quant 2025-05-07T20:31:57.6878486Z if compiled: 2025-05-07T20:31:57.6878590Z op = torch.compile(op) 2025-05-07T20:31:57.6878696Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6878769Z 2025-05-07T20:31:57.6878867Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6878871Z 2025-05-07T20:31:57.6878968Z moe/activation_test.py:117: 2025-05-07T20:31:57.6879100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6879210Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6879310Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6879828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6879934Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6880410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6880648Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6880999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6881095Z kernel = self.compile( 2025-05-07T20:31:57.6881494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6881671Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6881806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6881816Z 2025-05-07T20:31:57.6882025Z self = 2025-05-07T20:31:57.6882836Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6883359Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4477e7fce0>} 2025-05-07T20:31:57.6884133Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6884333Z context = 2025-05-07T20:31:57.6884337Z 2025-05-07T20:31:57.6884505Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6884852Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6884965Z module_map=module_map) 2025-05-07T20:31:57.6885135Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6885241Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6885319Z E ^ 2025-05-07T20:31:57.6885681Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6885686Z 2025-05-07T20:31:57.6886118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6886122Z 2025-05-07T20:31:57.6886227Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6886463Z self=, 2025-05-07T20:31:57.6886541Z T=4096, 2025-05-07T20:31:57.6886622Z D=7168, 2025-05-07T20:31:57.6886713Z scale_ub=1200.0, 2025-05-07T20:31:57.6886801Z contiguous=False, 2025-05-07T20:31:57.6886885Z compiled=True, 2025-05-07T20:31:57.6886979Z ) 2025-05-07T20:31:57.6887207Z self = 2025-05-07T20:31:57.6887392Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:57.6887396Z 2025-05-07T20:31:57.6887482Z @given( 2025-05-07T20:31:57.6887603Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6894566Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6894708Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6894837Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6894952Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6895034Z ) 2025-05-07T20:31:57.6895294Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6895399Z def test_silu_mul_quant( 2025-05-07T20:31:57.6895483Z self, 2025-05-07T20:31:57.6895561Z T: int, 2025-05-07T20:31:57.6895640Z D: int, 2025-05-07T20:31:57.6895864Z scale_ub: Optional[float], 2025-05-07T20:31:57.6895957Z contiguous: bool, 2025-05-07T20:31:57.6896043Z compiled: bool, 2025-05-07T20:31:57.6896133Z ) -> None: 2025-05-07T20:31:57.6896229Z torch.manual_seed(2025) 2025-05-07T20:31:57.6896304Z 2025-05-07T20:31:57.6896488Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6896567Z 2025-05-07T20:31:57.6896669Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6896797Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6896887Z x = x_sign * x_clamp 2025-05-07T20:31:57.6896977Z x0 = x[:, :D] 2025-05-07T20:31:57.6897057Z x1 = x[:, D:] 2025-05-07T20:31:57.6897130Z 2025-05-07T20:31:57.6897227Z if contiguous: 2025-05-07T20:31:57.6897320Z x0 = x0.contiguous() 2025-05-07T20:31:57.6897411Z x1 = x1.contiguous() 2025-05-07T20:31:57.6897495Z 2025-05-07T20:31:57.6897586Z if scale_ub is not None: 2025-05-07T20:31:57.6897698Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6897846Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6897923Z ) 2025-05-07T20:31:57.6898007Z else: 2025-05-07T20:31:57.6898103Z scale_ub_tensor = None 2025-05-07T20:31:57.6898177Z 2025-05-07T20:31:57.6898321Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6898412Z op = silu_mul_quant 2025-05-07T20:31:57.6898498Z if compiled: 2025-05-07T20:31:57.6898606Z op = torch.compile(op) 2025-05-07T20:31:57.6898714Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6898787Z 2025-05-07T20:31:57.6898882Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6898971Z 2025-05-07T20:31:57.6899072Z moe/activation_test.py:117: 2025-05-07T20:31:57.6899212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6899319Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6899420Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6899813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.6899907Z return fn(*args, **kwargs) 2025-05-07T20:31:57.6900426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6900523Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6900893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6901129Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6901483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6901579Z kernel = self.compile( 2025-05-07T20:31:57.6901987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6902193Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6902355Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6902360Z 2025-05-07T20:31:57.6902570Z self = 2025-05-07T20:31:57.6903378Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6903904Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44c6ff5f80>} 2025-05-07T20:31:57.6904758Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6904962Z context = 2025-05-07T20:31:57.6904967Z 2025-05-07T20:31:57.6905136Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6905414Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6905523Z module_map=module_map) 2025-05-07T20:31:57.6905688Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6905793Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6905870Z E ^ 2025-05-07T20:31:57.6906605Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6906614Z 2025-05-07T20:31:57.6907110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6907116Z 2025-05-07T20:31:57.6907221Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6907459Z self=, 2025-05-07T20:31:57.6907536Z T=128, 2025-05-07T20:31:57.6907615Z D=7168, 2025-05-07T20:31:57.6907704Z scale_ub=1200.0, 2025-05-07T20:31:57.6907790Z contiguous=False, 2025-05-07T20:31:57.6907873Z compiled=True, 2025-05-07T20:31:57.6907955Z ) 2025-05-07T20:31:57.6908179Z self = 2025-05-07T20:31:57.6908355Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:57.6908360Z 2025-05-07T20:31:57.6908679Z @given( 2025-05-07T20:31:57.6908799Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6908904Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6909019Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6909142Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6909262Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6909335Z ) 2025-05-07T20:31:57.6909586Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6909684Z def test_silu_mul_quant( 2025-05-07T20:31:57.6909760Z self, 2025-05-07T20:31:57.6909839Z T: int, 2025-05-07T20:31:57.6909921Z D: int, 2025-05-07T20:31:57.6910018Z scale_ub: Optional[float], 2025-05-07T20:31:57.6910111Z contiguous: bool, 2025-05-07T20:31:57.6910196Z compiled: bool, 2025-05-07T20:31:57.6910273Z ) -> None: 2025-05-07T20:31:57.6910374Z torch.manual_seed(2025) 2025-05-07T20:31:57.6910451Z 2025-05-07T20:31:57.6910623Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6910704Z 2025-05-07T20:31:57.6910795Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6910928Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6911023Z x = x_sign * x_clamp 2025-05-07T20:31:57.6911102Z x0 = x[:, :D] 2025-05-07T20:31:57.6911181Z x1 = x[:, D:] 2025-05-07T20:31:57.6911262Z 2025-05-07T20:31:57.6911347Z if contiguous: 2025-05-07T20:31:57.6911438Z x0 = x0.contiguous() 2025-05-07T20:31:57.6911534Z x1 = x1.contiguous() 2025-05-07T20:31:57.6911607Z 2025-05-07T20:31:57.6911703Z if scale_ub is not None: 2025-05-07T20:31:57.6911806Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6911942Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6912021Z ) 2025-05-07T20:31:57.6912103Z else: 2025-05-07T20:31:57.6912196Z scale_ub_tensor = None 2025-05-07T20:31:57.6912273Z 2025-05-07T20:31:57.6912403Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6912493Z op = silu_mul_quant 2025-05-07T20:31:57.6912717Z if compiled: 2025-05-07T20:31:57.6912819Z op = torch.compile(op) 2025-05-07T20:31:57.6912926Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6913004Z 2025-05-07T20:31:57.6913094Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6913098Z 2025-05-07T20:31:57.6913200Z moe/activation_test.py:117: 2025-05-07T20:31:57.6913335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6913435Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6913540Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6913917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.6914016Z return fn(*args, **kwargs) 2025-05-07T20:31:57.6914537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6914639Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6915012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6915239Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6915589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6915688Z kernel = self.compile( 2025-05-07T20:31:57.6916084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6916262Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6916403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6916486Z 2025-05-07T20:31:57.6916695Z self = 2025-05-07T20:31:57.6917518Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6918034Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f448572d9e0>} 2025-05-07T20:31:57.6918816Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6919010Z context = 2025-05-07T20:31:57.6919020Z 2025-05-07T20:31:57.6919188Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6919466Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6919577Z module_map=module_map) 2025-05-07T20:31:57.6919747Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6919847Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6919925Z E ^ 2025-05-07T20:31:57.6920296Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6920301Z 2025-05-07T20:31:57.6920729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6920733Z 2025-05-07T20:31:57.6920839Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6921074Z self=, 2025-05-07T20:31:57.6921157Z T=2048, 2025-05-07T20:31:57.6921241Z D=7168, 2025-05-07T20:31:57.6921324Z scale_ub=None, 2025-05-07T20:31:57.6921412Z contiguous=True, 2025-05-07T20:31:57.6921503Z compiled=True, 2025-05-07T20:31:57.6921652Z ) 2025-05-07T20:31:57.6921879Z self = 2025-05-07T20:31:57.6922059Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:57.6922064Z 2025-05-07T20:31:57.6922140Z @given( 2025-05-07T20:31:57.6922260Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6922366Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6922481Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6922607Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6922722Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6922796Z ) 2025-05-07T20:31:57.6923060Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6923152Z def test_silu_mul_quant( 2025-05-07T20:31:57.6923227Z self, 2025-05-07T20:31:57.6923312Z T: int, 2025-05-07T20:31:57.6923396Z D: int, 2025-05-07T20:31:57.6923493Z scale_ub: Optional[float], 2025-05-07T20:31:57.6923586Z contiguous: bool, 2025-05-07T20:31:57.6923671Z compiled: bool, 2025-05-07T20:31:57.6923748Z ) -> None: 2025-05-07T20:31:57.6923848Z torch.manual_seed(2025) 2025-05-07T20:31:57.6923920Z 2025-05-07T20:31:57.6924099Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6924173Z 2025-05-07T20:31:57.6924264Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6924395Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6924485Z x = x_sign * x_clamp 2025-05-07T20:31:57.6924564Z x0 = x[:, :D] 2025-05-07T20:31:57.6924649Z x1 = x[:, D:] 2025-05-07T20:31:57.6924829Z 2025-05-07T20:31:57.6924913Z if contiguous: 2025-05-07T20:31:57.6925010Z x0 = x0.contiguous() 2025-05-07T20:31:57.6925098Z x1 = x1.contiguous() 2025-05-07T20:31:57.6925168Z 2025-05-07T20:31:57.6925270Z if scale_ub is not None: 2025-05-07T20:31:57.6925375Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6925520Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6925596Z ) 2025-05-07T20:31:57.6925673Z else: 2025-05-07T20:31:57.6925773Z scale_ub_tensor = None 2025-05-07T20:31:57.6925849Z 2025-05-07T20:31:57.6925978Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6926075Z op = silu_mul_quant 2025-05-07T20:31:57.6926163Z if compiled: 2025-05-07T20:31:57.6926261Z op = torch.compile(op) 2025-05-07T20:31:57.6926372Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6926452Z 2025-05-07T20:31:57.6926542Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6926552Z 2025-05-07T20:31:57.6926648Z moe/activation_test.py:117: 2025-05-07T20:31:57.6926784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6926890Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6926988Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6927365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.6927462Z return fn(*args, **kwargs) 2025-05-07T20:31:57.6927969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6928067Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6928440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6928666Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6929031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6929126Z kernel = self.compile( 2025-05-07T20:31:57.6929599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6929784Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6929916Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6929920Z 2025-05-07T20:31:57.6930135Z self = 2025-05-07T20:31:57.6930939Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6931459Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f447521d800>} 2025-05-07T20:31:57.6932356Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6932550Z context = 2025-05-07T20:31:57.6932554Z 2025-05-07T20:31:57.6932727Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6932998Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6933103Z module_map=module_map) 2025-05-07T20:31:57.6933274Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6933374Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6933535Z E ^ 2025-05-07T20:31:57.6933900Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6933905Z 2025-05-07T20:31:57.6934342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6934346Z 2025-05-07T20:31:57.6934456Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6934685Z self=, 2025-05-07T20:31:57.6934770Z T=16384, 2025-05-07T20:31:57.6934847Z D=5120, 2025-05-07T20:31:57.6934928Z scale_ub=None, 2025-05-07T20:31:57.6935021Z contiguous=False, 2025-05-07T20:31:57.6935105Z compiled=False, 2025-05-07T20:31:57.6935178Z ) 2025-05-07T20:31:57.6935412Z self = 2025-05-07T20:31:57.6935594Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:57.6935605Z 2025-05-07T20:31:57.6935682Z @given( 2025-05-07T20:31:57.6935807Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6935907Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6936026Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6936150Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6936267Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6936347Z ) 2025-05-07T20:31:57.6936600Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6936692Z def test_silu_mul_quant( 2025-05-07T20:31:57.6936776Z self, 2025-05-07T20:31:57.6936853Z T: int, 2025-05-07T20:31:57.6936930Z D: int, 2025-05-07T20:31:57.6937037Z scale_ub: Optional[float], 2025-05-07T20:31:57.6937127Z contiguous: bool, 2025-05-07T20:31:57.6937215Z compiled: bool, 2025-05-07T20:31:57.6937303Z ) -> None: 2025-05-07T20:31:57.6937398Z torch.manual_seed(2025) 2025-05-07T20:31:57.6937470Z 2025-05-07T20:31:57.6937649Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6937722Z 2025-05-07T20:31:57.6937899Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6938025Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6939915Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.6939932Z 2025-05-07T20:31:57.6940053Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:57.6940057Z 2025-05-07T20:31:57.6940159Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6940398Z self=, 2025-05-07T20:31:57.6940475Z T=4096, 2025-05-07T20:31:57.6940552Z D=7168, 2025-05-07T20:31:57.6940640Z scale_ub=1200.0, 2025-05-07T20:31:57.6940725Z contiguous=True, 2025-05-07T20:31:57.6940809Z compiled=True, 2025-05-07T20:31:57.6940887Z ) 2025-05-07T20:31:57.6941111Z self = 2025-05-07T20:31:57.6941291Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:57.6941295Z 2025-05-07T20:31:57.6941372Z @given( 2025-05-07T20:31:57.6941491Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6941593Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6941706Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6941934Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6942055Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6942127Z ) 2025-05-07T20:31:57.6942404Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6942509Z def test_silu_mul_quant( 2025-05-07T20:31:57.6942598Z self, 2025-05-07T20:31:57.6942690Z T: int, 2025-05-07T20:31:57.6942767Z D: int, 2025-05-07T20:31:57.6942864Z scale_ub: Optional[float], 2025-05-07T20:31:57.6942960Z contiguous: bool, 2025-05-07T20:31:57.6943045Z compiled: bool, 2025-05-07T20:31:57.6943122Z ) -> None: 2025-05-07T20:31:57.6943223Z torch.manual_seed(2025) 2025-05-07T20:31:57.6943297Z 2025-05-07T20:31:57.6943465Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6943545Z 2025-05-07T20:31:57.6943635Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6943767Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6945639Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.6945646Z 2025-05-07T20:31:57.6945770Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:57.6945774Z 2025-05-07T20:31:57.6945876Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6946103Z self=, 2025-05-07T20:31:57.6946192Z T=16384, 2025-05-07T20:31:57.6946267Z D=7168, 2025-05-07T20:31:57.6946349Z scale_ub=None, 2025-05-07T20:31:57.6946439Z contiguous=False, 2025-05-07T20:31:57.6946521Z compiled=False, 2025-05-07T20:31:57.6946595Z ) 2025-05-07T20:31:57.6946902Z self = 2025-05-07T20:31:57.6947083Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:57.6947087Z 2025-05-07T20:31:57.6947169Z @given( 2025-05-07T20:31:57.6947289Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6947386Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6947507Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6947623Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6947737Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6947816Z ) 2025-05-07T20:31:57.6948068Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6948166Z def test_silu_mul_quant( 2025-05-07T20:31:57.6948248Z self, 2025-05-07T20:31:57.6948326Z T: int, 2025-05-07T20:31:57.6948410Z D: int, 2025-05-07T20:31:57.6948513Z scale_ub: Optional[float], 2025-05-07T20:31:57.6948603Z contiguous: bool, 2025-05-07T20:31:57.6948695Z compiled: bool, 2025-05-07T20:31:57.6948774Z ) -> None: 2025-05-07T20:31:57.6948869Z torch.manual_seed(2025) 2025-05-07T20:31:57.6948948Z 2025-05-07T20:31:57.6949115Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6950980Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.6951075Z 2025-05-07T20:31:57.6951194Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:57.6951199Z 2025-05-07T20:31:57.6951301Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6951533Z self=, 2025-05-07T20:31:57.6951610Z T=2048, 2025-05-07T20:31:57.6951693Z D=7168, 2025-05-07T20:31:57.6951778Z scale_ub=1200.0, 2025-05-07T20:31:57.6951866Z contiguous=True, 2025-05-07T20:31:57.6951955Z compiled=True, 2025-05-07T20:31:57.6952028Z ) 2025-05-07T20:31:57.6952251Z self = 2025-05-07T20:31:57.6952432Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:57.6952441Z 2025-05-07T20:31:57.6952520Z @given( 2025-05-07T20:31:57.6952639Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6952746Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6952864Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6952989Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6953103Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6953179Z ) 2025-05-07T20:31:57.6953440Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6953534Z def test_silu_mul_quant( 2025-05-07T20:31:57.6953610Z self, 2025-05-07T20:31:57.6953695Z T: int, 2025-05-07T20:31:57.6953770Z D: int, 2025-05-07T20:31:57.6953867Z scale_ub: Optional[float], 2025-05-07T20:31:57.6953966Z contiguous: bool, 2025-05-07T20:31:57.6954051Z compiled: bool, 2025-05-07T20:31:57.6954128Z ) -> None: 2025-05-07T20:31:57.6954232Z torch.manual_seed(2025) 2025-05-07T20:31:57.6954304Z 2025-05-07T20:31:57.6954471Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6954550Z 2025-05-07T20:31:57.6954755Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6954888Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6956736Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.6956742Z 2025-05-07T20:31:57.6956871Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:57.6956876Z 2025-05-07T20:31:57.6956995Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6957228Z self=, 2025-05-07T20:31:57.6957312Z T=2048, 2025-05-07T20:31:57.6957390Z D=7168, 2025-05-07T20:31:57.6957481Z scale_ub=None, 2025-05-07T20:31:57.6957567Z contiguous=True, 2025-05-07T20:31:57.6957651Z compiled=False, 2025-05-07T20:31:57.6957733Z ) 2025-05-07T20:31:57.6957954Z self = 2025-05-07T20:31:57.6958131Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:57.6958141Z 2025-05-07T20:31:57.6958218Z @given( 2025-05-07T20:31:57.6958338Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6958444Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6958557Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6958762Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6958883Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6958957Z ) 2025-05-07T20:31:57.6959213Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6959314Z def test_silu_mul_quant( 2025-05-07T20:31:57.6959390Z self, 2025-05-07T20:31:57.6959467Z T: int, 2025-05-07T20:31:57.6959550Z D: int, 2025-05-07T20:31:57.6959647Z scale_ub: Optional[float], 2025-05-07T20:31:57.6959743Z contiguous: bool, 2025-05-07T20:31:57.6959829Z compiled: bool, 2025-05-07T20:31:57.6959906Z ) -> None: 2025-05-07T20:31:57.6960008Z torch.manual_seed(2025) 2025-05-07T20:31:57.6960082Z 2025-05-07T20:31:57.6960249Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6960332Z 2025-05-07T20:31:57.6960425Z > x_sign = torch.sign(x) 2025-05-07T20:31:57.6962291Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.6962297Z 2025-05-07T20:31:57.6962416Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:57.6962421Z 2025-05-07T20:31:57.6962523Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6962756Z self=, 2025-05-07T20:31:57.6962832Z T=1, 2025-05-07T20:31:57.6962917Z D=7168, 2025-05-07T20:31:57.6963002Z scale_ub=1200.0, 2025-05-07T20:31:57.6963093Z contiguous=True, 2025-05-07T20:31:57.6963184Z compiled=False, 2025-05-07T20:31:57.6963255Z ) 2025-05-07T20:31:57.6963477Z self = 2025-05-07T20:31:57.6963736Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:57.6963741Z 2025-05-07T20:31:57.6963819Z @given( 2025-05-07T20:31:57.6963938Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6964042Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6964157Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6964278Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6964392Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6964467Z ) 2025-05-07T20:31:57.6964722Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6964814Z def test_silu_mul_quant( 2025-05-07T20:31:57.6964895Z self, 2025-05-07T20:31:57.6964977Z T: int, 2025-05-07T20:31:57.6965051Z D: int, 2025-05-07T20:31:57.6965147Z scale_ub: Optional[float], 2025-05-07T20:31:57.6965242Z contiguous: bool, 2025-05-07T20:31:57.6965332Z compiled: bool, 2025-05-07T20:31:57.6965409Z ) -> None: 2025-05-07T20:31:57.6965510Z torch.manual_seed(2025) 2025-05-07T20:31:57.6965586Z 2025-05-07T20:31:57.6965760Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6965833Z 2025-05-07T20:31:57.6965924Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6966057Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6966146Z x = x_sign * x_clamp 2025-05-07T20:31:57.6966226Z x0 = x[:, :D] 2025-05-07T20:31:57.6966313Z x1 = x[:, D:] 2025-05-07T20:31:57.6966385Z 2025-05-07T20:31:57.6966470Z if contiguous: 2025-05-07T20:31:57.6966567Z x0 = x0.contiguous() 2025-05-07T20:31:57.6966740Z x1 = x1.contiguous() 2025-05-07T20:31:57.6966812Z 2025-05-07T20:31:57.6966909Z if scale_ub is not None: 2025-05-07T20:31:57.6967014Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6967155Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6967237Z ) 2025-05-07T20:31:57.6967314Z else: 2025-05-07T20:31:57.6967414Z scale_ub_tensor = None 2025-05-07T20:31:57.6967489Z 2025-05-07T20:31:57.6967621Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6967718Z op = silu_mul_quant 2025-05-07T20:31:57.6967803Z if compiled: 2025-05-07T20:31:57.6967902Z op = torch.compile(op) 2025-05-07T20:31:57.6968017Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6968090Z 2025-05-07T20:31:57.6968181Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6968185Z 2025-05-07T20:31:57.6968290Z moe/activation_test.py:117: 2025-05-07T20:31:57.6968427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6968537Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6968638Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6969161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6969265Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6969635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6969862Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6970222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6970317Z kernel = self.compile( 2025-05-07T20:31:57.6970720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6970903Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6971034Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6971121Z 2025-05-07T20:31:57.6971338Z self = 2025-05-07T20:31:57.6972293Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6972819Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4474e5cea0>} 2025-05-07T20:31:57.6973596Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6973793Z context = 2025-05-07T20:31:57.6973806Z 2025-05-07T20:31:57.6973981Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6974252Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6974368Z module_map=module_map) 2025-05-07T20:31:57.6974533Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6974632Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6974715Z E ^ 2025-05-07T20:31:57.6975082Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6975087Z 2025-05-07T20:31:57.6975520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6975605Z 2025-05-07T20:31:57.6975710Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6975937Z self=, 2025-05-07T20:31:57.6976022Z T=128, 2025-05-07T20:31:57.6976105Z D=5120, 2025-05-07T20:31:57.6976188Z scale_ub=None, 2025-05-07T20:31:57.6976280Z contiguous=True, 2025-05-07T20:31:57.6976365Z compiled=False, 2025-05-07T20:31:57.6976440Z ) 2025-05-07T20:31:57.6976671Z self = 2025-05-07T20:31:57.6976846Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:57.6976851Z 2025-05-07T20:31:57.6976934Z @given( 2025-05-07T20:31:57.6977053Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6977154Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6977275Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6977393Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6977512Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6977592Z ) 2025-05-07T20:31:57.6977849Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6977948Z def test_silu_mul_quant( 2025-05-07T20:31:57.6978023Z self, 2025-05-07T20:31:57.6978099Z T: int, 2025-05-07T20:31:57.6978181Z D: int, 2025-05-07T20:31:57.6978279Z scale_ub: Optional[float], 2025-05-07T20:31:57.6978369Z contiguous: bool, 2025-05-07T20:31:57.6978460Z compiled: bool, 2025-05-07T20:31:57.6978537Z ) -> None: 2025-05-07T20:31:57.6978630Z torch.manual_seed(2025) 2025-05-07T20:31:57.6978712Z 2025-05-07T20:31:57.6978883Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6978957Z 2025-05-07T20:31:57.6979055Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6979179Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6979273Z x = x_sign * x_clamp 2025-05-07T20:31:57.6979359Z x0 = x[:, :D] 2025-05-07T20:31:57.6979438Z x1 = x[:, D:] 2025-05-07T20:31:57.6979515Z 2025-05-07T20:31:57.6979681Z if contiguous: 2025-05-07T20:31:57.6979776Z x0 = x0.contiguous() 2025-05-07T20:31:57.6979874Z x1 = x1.contiguous() 2025-05-07T20:31:57.6979946Z 2025-05-07T20:31:57.6980035Z if scale_ub is not None: 2025-05-07T20:31:57.6980144Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6980281Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6980355Z ) 2025-05-07T20:31:57.6980438Z else: 2025-05-07T20:31:57.6980532Z scale_ub_tensor = None 2025-05-07T20:31:57.6980604Z 2025-05-07T20:31:57.6980741Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6980832Z op = silu_mul_quant 2025-05-07T20:31:57.6980921Z if compiled: 2025-05-07T20:31:57.6981027Z op = torch.compile(op) 2025-05-07T20:31:57.6981131Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6981210Z 2025-05-07T20:31:57.6981299Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6981311Z 2025-05-07T20:31:57.6981409Z moe/activation_test.py:117: 2025-05-07T20:31:57.6981544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6981644Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6981745Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6982299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6982411Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6982785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6983011Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6983443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6983542Z kernel = self.compile( 2025-05-07T20:31:57.6983942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6984120Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6984258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6984263Z 2025-05-07T20:31:57.6984470Z self = 2025-05-07T20:31:57.6985281Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6985806Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4474e5df80>} 2025-05-07T20:31:57.6986591Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6986783Z context = 2025-05-07T20:31:57.6986788Z 2025-05-07T20:31:57.6986957Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6987236Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6987342Z module_map=module_map) 2025-05-07T20:31:57.6987516Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6987616Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6987698Z E ^ 2025-05-07T20:31:57.6988070Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6988075Z 2025-05-07T20:31:57.6988604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6988609Z 2025-05-07T20:31:57.6988722Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6988951Z self=, 2025-05-07T20:31:57.6989029Z T=128, 2025-05-07T20:31:57.6989110Z D=7168, 2025-05-07T20:31:57.6989193Z scale_ub=None, 2025-05-07T20:31:57.6989279Z contiguous=True, 2025-05-07T20:31:57.6989370Z compiled=False, 2025-05-07T20:31:57.6989445Z ) 2025-05-07T20:31:57.6989669Z self = 2025-05-07T20:31:57.6989849Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:57.6989858Z 2025-05-07T20:31:57.6989935Z @given( 2025-05-07T20:31:57.6990054Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6990160Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6990284Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6990407Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6990523Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6990597Z ) 2025-05-07T20:31:57.6990857Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6990949Z def test_silu_mul_quant( 2025-05-07T20:31:57.6991025Z self, 2025-05-07T20:31:57.6991108Z T: int, 2025-05-07T20:31:57.6991184Z D: int, 2025-05-07T20:31:57.6991281Z scale_ub: Optional[float], 2025-05-07T20:31:57.6991380Z contiguous: bool, 2025-05-07T20:31:57.6991480Z compiled: bool, 2025-05-07T20:31:57.6991576Z ) -> None: 2025-05-07T20:31:57.6991775Z torch.manual_seed(2025) 2025-05-07T20:31:57.6991847Z 2025-05-07T20:31:57.6992028Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6992102Z 2025-05-07T20:31:57.6992198Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6992329Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6992416Z x = x_sign * x_clamp 2025-05-07T20:31:57.6992496Z x0 = x[:, :D] 2025-05-07T20:31:57.6992582Z x1 = x[:, D:] 2025-05-07T20:31:57.6992654Z 2025-05-07T20:31:57.6992739Z if contiguous: 2025-05-07T20:31:57.6992836Z x0 = x0.contiguous() 2025-05-07T20:31:57.6992924Z x1 = x1.contiguous() 2025-05-07T20:31:57.6992994Z 2025-05-07T20:31:57.6993089Z if scale_ub is not None: 2025-05-07T20:31:57.6993193Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6993335Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6993418Z ) 2025-05-07T20:31:57.6993494Z else: 2025-05-07T20:31:57.6993594Z scale_ub_tensor = None 2025-05-07T20:31:57.6993665Z 2025-05-07T20:31:57.6993794Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6993894Z op = silu_mul_quant 2025-05-07T20:31:57.6993979Z if compiled: 2025-05-07T20:31:57.6994077Z op = torch.compile(op) 2025-05-07T20:31:57.6994188Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6994260Z 2025-05-07T20:31:57.6994351Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6994362Z 2025-05-07T20:31:57.6994459Z moe/activation_test.py:117: 2025-05-07T20:31:57.6994592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6994697Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6994797Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6995313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6995421Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6995876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6996105Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6996466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6996560Z kernel = self.compile( 2025-05-07T20:31:57.6996961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6997138Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6997271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6997275Z 2025-05-07T20:31:57.6997488Z self = 2025-05-07T20:31:57.6998300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6998823Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4474e5ee80>} 2025-05-07T20:31:57.6999596Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6999793Z context = 2025-05-07T20:31:57.6999798Z 2025-05-07T20:31:57.6999966Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.7000238Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.7000427Z module_map=module_map) 2025-05-07T20:31:57.7000592Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.7000694Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.7000780Z E ^ 2025-05-07T20:31:57.7001145Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.7001149Z 2025-05-07T20:31:57.7001586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.7001590Z 2025-05-07T20:31:57.7001694Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.7001923Z self=, 2025-05-07T20:31:57.7002006Z T=2048, 2025-05-07T20:31:57.7002087Z D=7168, 2025-05-07T20:31:57.7002170Z scale_ub=1200.0, 2025-05-07T20:31:57.7002265Z contiguous=True, 2025-05-07T20:31:57.7002350Z compiled=False, 2025-05-07T20:31:57.7002430Z ) 2025-05-07T20:31:57.7002654Z self = 2025-05-07T20:31:57.7002839Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:57.7002843Z 2025-05-07T20:31:57.7002931Z @given( 2025-05-07T20:31:57.7003051Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.7003152Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.7003275Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.7003392Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.7003508Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.7003591Z ) 2025-05-07T20:31:57.7003843Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.7003946Z def test_silu_mul_quant( 2025-05-07T20:31:57.7004027Z self, 2025-05-07T20:31:57.7004105Z T: int, 2025-05-07T20:31:57.7004190Z D: int, 2025-05-07T20:31:57.7004288Z scale_ub: Optional[float], 2025-05-07T20:31:57.7004379Z contiguous: bool, 2025-05-07T20:31:57.7004558Z compiled: bool, 2025-05-07T20:31:57.7004638Z ) -> None: 2025-05-07T20:31:57.7004734Z torch.manual_seed(2025) 2025-05-07T20:31:57.7004813Z 2025-05-07T20:31:57.7004982Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.7007229Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.7007250Z 2025-05-07T20:31:57.7007378Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:57.7007383Z 2025-05-07T20:31:57.7007501Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.7007729Z self=, 2025-05-07T20:31:57.7007806Z T=1, 2025-05-07T20:31:57.7007890Z D=5120, 2025-05-07T20:31:57.7007974Z scale_ub=1200.0, 2025-05-07T20:31:57.7008059Z contiguous=True, 2025-05-07T20:31:57.7008148Z compiled=False, 2025-05-07T20:31:57.7008222Z ) 2025-05-07T20:31:57.7008444Z self = 2025-05-07T20:31:57.7008623Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:57.7008627Z 2025-05-07T20:31:57.7008703Z @given( 2025-05-07T20:31:57.7008831Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.7009149Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.7009265Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.7009387Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.7009505Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.7009581Z ) 2025-05-07T20:31:57.7009838Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.7009930Z def test_silu_mul_quant( 2025-05-07T20:31:57.7010005Z self, 2025-05-07T20:31:57.7010087Z T: int, 2025-05-07T20:31:57.7010162Z D: int, 2025-05-07T20:31:57.7010260Z scale_ub: Optional[float], 2025-05-07T20:31:57.7010354Z contiguous: bool, 2025-05-07T20:31:57.7010439Z compiled: bool, 2025-05-07T20:31:57.7010523Z ) -> None: 2025-05-07T20:31:57.7010618Z torch.manual_seed(2025) 2025-05-07T20:31:57.7010691Z 2025-05-07T20:31:57.7010866Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.7010946Z 2025-05-07T20:31:57.7011038Z x_sign = torch.sign(x) 2025-05-07T20:31:57.7011167Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.7011260Z x = x_sign * x_clamp 2025-05-07T20:31:57.7011337Z x0 = x[:, :D] 2025-05-07T20:31:57.7011425Z x1 = x[:, D:] 2025-05-07T20:31:57.7011496Z 2025-05-07T20:31:57.7011579Z if contiguous: 2025-05-07T20:31:57.7011678Z x0 = x0.contiguous() 2025-05-07T20:31:57.7011843Z x1 = x1.contiguous() 2025-05-07T20:31:57.7011917Z 2025-05-07T20:31:57.7012017Z if scale_ub is not None: 2025-05-07T20:31:57.7012146Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.7012308Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.7012388Z ) 2025-05-07T20:31:57.7012469Z else: 2025-05-07T20:31:57.7012572Z scale_ub_tensor = None 2025-05-07T20:31:57.7012650Z 2025-05-07T20:31:57.7012780Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.7012875Z op = silu_mul_quant 2025-05-07T20:31:57.7012959Z if compiled: 2025-05-07T20:31:57.7013195Z op = torch.compile(op) 2025-05-07T20:31:57.7013309Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.7013382Z 2025-05-07T20:31:57.7013471Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.7013482Z 2025-05-07T20:31:57.7013580Z moe/activation_test.py:117: 2025-05-07T20:31:57.7013712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.7013820Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.7013919Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.7014440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.7014542Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.7014919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.7015154Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.7015512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.7015606Z kernel = self.compile( 2025-05-07T20:31:57.7016008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.7016186Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.7016315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.7016320Z 2025-05-07T20:31:57.7016536Z self = 2025-05-07T20:31:57.7017345Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.7018000Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4474a1c400>} 2025-05-07T20:31:57.7018773Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.7018973Z context = 2025-05-07T20:31:57.7018977Z 2025-05-07T20:31:57.7019147Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.7019420Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.7026040Z module_map=module_map) 2025-05-07T20:31:57.7026246Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.7026349Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.7026427Z E ^ 2025-05-07T20:31:57.7026809Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.7026815Z 2025-05-07T20:31:57.7027252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.7027256Z 2025-05-07T20:31:57.7027371Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.7027600Z self=, 2025-05-07T20:31:57.7027678Z T=2048, 2025-05-07T20:31:57.7027766Z D=5120, 2025-05-07T20:31:57.7027849Z scale_ub=None, 2025-05-07T20:31:57.7027935Z contiguous=True, 2025-05-07T20:31:57.7028029Z compiled=False, 2025-05-07T20:31:57.7028106Z ) 2025-05-07T20:31:57.7028331Z self = 2025-05-07T20:31:57.7028521Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:57.7028525Z 2025-05-07T20:31:57.7028720Z @given( 2025-05-07T20:31:57.7028860Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.7028962Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.7029080Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.7029210Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.7029326Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.7029402Z ) 2025-05-07T20:31:57.7029666Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.7029761Z def test_silu_mul_quant( 2025-05-07T20:31:57.7029839Z self, 2025-05-07T20:31:57.7029925Z T: int, 2025-05-07T20:31:57.7030002Z D: int, 2025-05-07T20:31:57.7030114Z scale_ub: Optional[float], 2025-05-07T20:31:57.7030204Z contiguous: bool, 2025-05-07T20:31:57.7030290Z compiled: bool, 2025-05-07T20:31:57.7030380Z ) -> None: 2025-05-07T20:31:57.7030482Z torch.manual_seed(2025) 2025-05-07T20:31:57.7030559Z 2025-05-07T20:31:57.7030739Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.7030817Z 2025-05-07T20:31:57.7030914Z > x_sign = torch.sign(x) 2025-05-07T20:31:57.7032786Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.7032877Z 2025-05-07T20:31:57.7033001Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:57.7033005Z 2025-05-07T20:31:57.7033121Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.7033348Z self=, 2025-05-07T20:31:57.7033435Z T=16384, 2025-05-07T20:31:57.7033511Z D=5120, 2025-05-07T20:31:57.7033594Z scale_ub=None, 2025-05-07T20:31:57.7033687Z contiguous=True, 2025-05-07T20:31:57.7033771Z compiled=False, 2025-05-07T20:31:57.7033843Z ) 2025-05-07T20:31:57.7034073Z self = 2025-05-07T20:31:57.7034253Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:57.7034257Z 2025-05-07T20:31:57.7034342Z @given( 2025-05-07T20:31:57.7034462Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.7034567Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.7034689Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.7034808Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.7034928Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.7035010Z ) 2025-05-07T20:31:57.7035262Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.7035363Z def test_silu_mul_quant( 2025-05-07T20:31:57.7035441Z self, 2025-05-07T20:31:57.7035518Z T: int, 2025-05-07T20:31:57.7035602Z D: int, 2025-05-07T20:31:57.7035700Z scale_ub: Optional[float], 2025-05-07T20:31:57.7035790Z contiguous: bool, 2025-05-07T20:31:57.7035884Z compiled: bool, 2025-05-07T20:31:57.7035964Z ) -> None: 2025-05-07T20:31:57.7036059Z torch.manual_seed(2025) 2025-05-07T20:31:57.7036139Z 2025-05-07T20:31:57.7036310Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.7038261Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.7038268Z 2025-05-07T20:31:57.7038388Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:57.7038393Z 2025-05-07T20:31:57.7038496Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.7038730Z self=, 2025-05-07T20:31:57.7038808Z T=4096, 2025-05-07T20:31:57.7038890Z D=5120, 2025-05-07T20:31:57.7038978Z scale_ub=None, 2025-05-07T20:31:57.7039061Z contiguous=True, 2025-05-07T20:31:57.7039152Z compiled=False, 2025-05-07T20:31:57.7039225Z ) 2025-05-07T20:31:57.7039450Z self = 2025-05-07T20:31:57.7039632Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:57.7039637Z 2025-05-07T20:31:57.7039713Z @given( 2025-05-07T20:31:57.7039832Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.7039936Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.7040049Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.7040171Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.7040285Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.7040357Z ) 2025-05-07T20:31:57.7040618Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.7040711Z def test_silu_mul_quant( 2025-05-07T20:31:57.7040865Z self, 2025-05-07T20:31:57.7040947Z T: int, 2025-05-07T20:31:57.7041022Z D: int, 2025-05-07T20:31:57.7041118Z scale_ub: Optional[float], 2025-05-07T20:31:57.7041217Z contiguous: bool, 2025-05-07T20:31:57.7041302Z compiled: bool, 2025-05-07T20:31:57.7041380Z ) -> None: 2025-05-07T20:31:57.7041480Z torch.manual_seed(2025) 2025-05-07T20:31:57.7041551Z 2025-05-07T20:31:57.7041727Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.7043929Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.7043945Z 2025-05-07T20:31:57.7044072Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:57.7044082Z 2025-05-07T20:31:57.7044183Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.7044410Z self=, 2025-05-07T20:31:57.7044497Z T=2048, 2025-05-07T20:31:57.7044573Z D=5120, 2025-05-07T20:31:57.7044656Z scale_ub=None, 2025-05-07T20:31:57.7044748Z contiguous=False, 2025-05-07T20:31:57.7044831Z compiled=False, 2025-05-07T20:31:57.7044904Z ) 2025-05-07T20:31:57.7045130Z self = 2025-05-07T20:31:57.7045307Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:57.7045312Z 2025-05-07T20:31:57.7045395Z @given( 2025-05-07T20:31:57.7045517Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.7045617Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.7045736Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.7045937Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.7046054Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.7046133Z ) 2025-05-07T20:31:57.7046385Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.7046485Z def test_silu_mul_quant( 2025-05-07T20:31:57.7046561Z self, 2025-05-07T20:31:57.7046638Z T: int, 2025-05-07T20:31:57.7046720Z D: int, 2025-05-07T20:31:57.7046816Z scale_ub: Optional[float], 2025-05-07T20:31:57.7046903Z contiguous: bool, 2025-05-07T20:31:57.7046995Z compiled: bool, 2025-05-07T20:31:57.7047072Z ) -> None: 2025-05-07T20:31:57.7047166Z torch.manual_seed(2025) 2025-05-07T20:31:57.7047243Z 2025-05-07T20:31:57.7047416Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.7049274Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.7049280Z 2025-05-07T20:31:57.7049398Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:57.7049403Z 2025-05-07T20:31:57.7049504Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.7049736Z self=, 2025-05-07T20:31:57.7049890Z T=4096, 2025-05-07T20:31:57.7049974Z D=7168, 2025-05-07T20:31:57.7050058Z scale_ub=None, 2025-05-07T20:31:57.7050144Z contiguous=True, 2025-05-07T20:31:57.7050233Z compiled=True, 2025-05-07T20:31:57.7050307Z ) 2025-05-07T20:31:57.7050535Z self = 2025-05-07T20:31:57.7050712Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:57.7050716Z 2025-05-07T20:31:57.7050794Z @given( 2025-05-07T20:31:57.7050914Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.7051019Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.7051133Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.7051255Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.7051370Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.7051444Z ) 2025-05-07T20:31:57.7051703Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.7051880Z def test_silu_mul_quant( 2025-05-07T20:31:57.7051957Z self, 2025-05-07T20:31:57.7052040Z T: int, 2025-05-07T20:31:57.7052116Z D: int, 2025-05-07T20:31:57.7052219Z scale_ub: Optional[float], 2025-05-07T20:31:57.7052318Z contiguous: bool, 2025-05-07T20:31:57.7052422Z compiled: bool, 2025-05-07T20:31:57.7052507Z ) -> None: 2025-05-07T20:31:57.7052632Z torch.manual_seed(2025) 2025-05-07T20:31:57.7052707Z 2025-05-07T20:31:57.7052882Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.7054742Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.7054752Z 2025-05-07T20:31:57.7054958Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:57.7054963Z 2025-05-07T20:31:57.7055066Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.7055296Z self=, 2025-05-07T20:31:57.7055385Z T=2048, 2025-05-07T20:31:57.7055462Z D=5120, 2025-05-07T20:31:57.7055545Z scale_ub=1200.0, 2025-05-07T20:31:57.7055638Z contiguous=False, 2025-05-07T20:31:57.7055723Z compiled=False, 2025-05-07T20:31:57.7055796Z ) 2025-05-07T20:31:57.7056022Z self = 2025-05-07T20:31:57.7056201Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:57.7056209Z 2025-05-07T20:31:57.7056290Z @given( 2025-05-07T20:31:57.7056409Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.7056507Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.7056631Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.7056749Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.7056861Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.7056940Z ) 2025-05-07T20:31:57.7057195Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.7057295Z def test_silu_mul_quant( 2025-05-07T20:31:57.7057371Z self, 2025-05-07T20:31:57.7057448Z T: int, 2025-05-07T20:31:57.7057528Z D: int, 2025-05-07T20:31:57.7057625Z scale_ub: Optional[float], 2025-05-07T20:31:57.7057714Z contiguous: bool, 2025-05-07T20:31:57.7057805Z compiled: bool, 2025-05-07T20:31:57.7057881Z ) -> None: 2025-05-07T20:31:57.7058081Z torch.manual_seed(2025) 2025-05-07T20:31:57.7058159Z 2025-05-07T20:31:57.7058328Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.7060196Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.7060202Z 2025-05-07T20:31:57.7060320Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:57.7060325Z 2025-05-07T20:31:57.7060426Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.7060664Z self=, 2025-05-07T20:31:57.7060746Z T=4096, 2025-05-07T20:31:57.7060830Z D=7168, 2025-05-07T20:31:57.7060913Z scale_ub=1200.0, 2025-05-07T20:31:57.7060999Z contiguous=True, 2025-05-07T20:31:57.7061096Z compiled=False, 2025-05-07T20:31:57.7061168Z ) 2025-05-07T20:31:57.7061390Z self = 2025-05-07T20:31:57.7061574Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:57.7061579Z 2025-05-07T20:31:57.7061656Z @given( 2025-05-07T20:31:57.7061773Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.7061877Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.7061991Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.7062113Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.7062227Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.7062306Z ) 2025-05-07T20:31:57.7062564Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.7062658Z def test_silu_mul_quant( 2025-05-07T20:31:57.7062735Z self, 2025-05-07T20:31:57.7062901Z T: int, 2025-05-07T20:31:57.7062981Z D: int, 2025-05-07T20:31:57.7063077Z scale_ub: Optional[float], 2025-05-07T20:31:57.7063170Z contiguous: bool, 2025-05-07T20:31:57.7063257Z compiled: bool, 2025-05-07T20:31:57.7063335Z ) -> None: 2025-05-07T20:31:57.7063436Z torch.manual_seed(2025) 2025-05-07T20:31:57.7063510Z 2025-05-07T20:31:57.7063685Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.7065542Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.7065553Z 2025-05-07T20:31:57.7065679Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:57.7065683Z 2025-05-07T20:31:57.7065785Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.7066011Z self=, 2025-05-07T20:31:57.7066099Z T=16384, 2025-05-07T20:31:57.7066176Z D=7168, 2025-05-07T20:31:57.7066259Z scale_ub=None, 2025-05-07T20:31:57.7066355Z contiguous=False, 2025-05-07T20:31:57.7066437Z compiled=True, 2025-05-07T20:31:57.7066509Z ) 2025-05-07T20:31:57.7066737Z self = 2025-05-07T20:31:57.7066915Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:57.7066998Z 2025-05-07T20:31:57.7067082Z @given( 2025-05-07T20:31:57.7067205Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.7067309Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.7067429Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.7067547Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.7067661Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.7067741Z ) 2025-05-07T20:31:57.7067991Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.7068090Z def test_silu_mul_quant( 2025-05-07T20:31:57.7068168Z self, 2025-05-07T20:31:57.7068245Z T: int, 2025-05-07T20:31:57.7068326Z D: int, 2025-05-07T20:31:57.7068424Z scale_ub: Optional[float], 2025-05-07T20:31:57.7068511Z contiguous: bool, 2025-05-07T20:31:57.7068602Z compiled: bool, 2025-05-07T20:31:57.7068685Z ) -> None: 2025-05-07T20:31:57.7068779Z torch.manual_seed(2025) 2025-05-07T20:31:57.7068858Z 2025-05-07T20:31:57.7069026Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.7070888Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.7070894Z 2025-05-07T20:31:57.7071013Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:57.7071017Z 2025-05-07T20:31:57.7071124Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.7071358Z self=, 2025-05-07T20:31:57.7071434Z T=4096, 2025-05-07T20:31:57.7071517Z D=7168, 2025-05-07T20:31:57.7071679Z scale_ub=None, 2025-05-07T20:31:57.7071766Z contiguous=True, 2025-05-07T20:31:57.7071856Z compiled=False, 2025-05-07T20:31:57.7071928Z ) 2025-05-07T20:31:57.7072172Z self = 2025-05-07T20:31:57.7072377Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:57.7072382Z 2025-05-07T20:31:57.7072459Z @given( 2025-05-07T20:31:57.7072576Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.7072680Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.7072793Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.7072917Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.7073034Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.7073108Z ) 2025-05-07T20:31:57.7073364Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.7073463Z def test_silu_mul_quant( 2025-05-07T20:31:57.7073539Z self, 2025-05-07T20:31:57.7073625Z T: int, 2025-05-07T20:31:57.7073699Z D: int, 2025-05-07T20:31:57.7073795Z scale_ub: Optional[float], 2025-05-07T20:31:57.7073887Z contiguous: bool, 2025-05-07T20:31:57.7073972Z compiled: bool, 2025-05-07T20:31:57.7074049Z ) -> None: 2025-05-07T20:31:57.7074149Z torch.manual_seed(2025) 2025-05-07T20:31:57.7074225Z 2025-05-07T20:31:57.7074399Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.7076256Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.7076341Z 2025-05-07T20:31:57.7076465Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:57.7076469Z 2025-05-07T20:31:57.7076570Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.7076798Z self=, 2025-05-07T20:31:57.7076882Z T=16384, 2025-05-07T20:31:57.7076959Z D=7168, 2025-05-07T20:31:57.7077039Z scale_ub=None, 2025-05-07T20:31:57.7077129Z contiguous=True, 2025-05-07T20:31:57.7077213Z compiled=False, 2025-05-07T20:31:57.7077288Z ) 2025-05-07T20:31:57.7077516Z self = 2025-05-07T20:31:57.7077701Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:57.7077705Z 2025-05-07T20:31:57.7077790Z @given( 2025-05-07T20:31:57.7077913Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.7078011Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.7078130Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.7078246Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.7078359Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.7078439Z ) 2025-05-07T20:31:57.7078688Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.7078786Z def test_silu_mul_quant( 2025-05-07T20:31:57.7078862Z self, 2025-05-07T20:31:57.7078939Z T: int, 2025-05-07T20:31:57.7079021Z D: int, 2025-05-07T20:31:57.7079117Z scale_ub: Optional[float], 2025-05-07T20:31:57.7079208Z contiguous: bool, 2025-05-07T20:31:57.7079298Z compiled: bool, 2025-05-07T20:31:57.7079376Z ) -> None: 2025-05-07T20:31:57.7079471Z torch.manual_seed(2025) 2025-05-07T20:31:57.7079550Z 2025-05-07T20:31:57.7079796Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.7081656Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.7081662Z 2025-05-07T20:31:57.7081779Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:57.7081788Z 2025-05-07T20:31:57.7081890Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.7082122Z self=, 2025-05-07T20:31:57.7082205Z T=16384, 2025-05-07T20:31:57.7082287Z D=7168, 2025-05-07T20:31:57.7082372Z scale_ub=1200.0, 2025-05-07T20:31:57.7082472Z contiguous=True, 2025-05-07T20:31:57.7082573Z compiled=False, 2025-05-07T20:31:57.7082662Z ) 2025-05-07T20:31:57.7082892Z self = 2025-05-07T20:31:57.7083079Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:57.7083083Z 2025-05-07T20:31:57.7083160Z @given( 2025-05-07T20:31:57.7083279Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.7083384Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.7083497Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.7083698Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.7083813Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.7083887Z ) 2025-05-07T20:31:57.7084151Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.7084244Z def test_silu_mul_quant( 2025-05-07T20:31:57.7084322Z self, 2025-05-07T20:31:57.7084406Z T: int, 2025-05-07T20:31:57.7084481Z D: int, 2025-05-07T20:31:57.7084578Z scale_ub: Optional[float], 2025-05-07T20:31:57.7084672Z contiguous: bool, 2025-05-07T20:31:57.7084759Z compiled: bool, 2025-05-07T20:31:57.7084837Z ) -> None: 2025-05-07T20:31:57.7084939Z torch.manual_seed(2025) 2025-05-07T20:31:57.7085011Z 2025-05-07T20:31:57.7085188Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.7087048Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.7087059Z 2025-05-07T20:31:57.7087182Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:57.7087187Z 2025-05-07T20:31:57.7087289Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.7087516Z self=, 2025-05-07T20:31:57.7087600Z T=128, 2025-05-07T20:31:57.7087677Z D=5120, 2025-05-07T20:31:57.7087761Z scale_ub=1200.0, 2025-05-07T20:31:57.7087853Z contiguous=False, 2025-05-07T20:31:57.7087938Z compiled=False, 2025-05-07T20:31:57.7088015Z ) 2025-05-07T20:31:57.7088243Z self = 2025-05-07T20:31:57.7088418Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:57.7088422Z 2025-05-07T20:31:57.7088606Z @given( 2025-05-07T20:31:57.7088727Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.7088824Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.7088946Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.7089063Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.7089179Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.7089258Z ) 2025-05-07T20:31:57.7089508Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.7089620Z def test_silu_mul_quant( 2025-05-07T20:31:57.7089695Z self, 2025-05-07T20:31:57.7089772Z T: int, 2025-05-07T20:31:57.7089855Z D: int, 2025-05-07T20:31:57.7089957Z scale_ub: Optional[float], 2025-05-07T20:31:57.7090051Z contiguous: bool, 2025-05-07T20:31:57.7090136Z compiled: bool, 2025-05-07T20:31:57.7090212Z ) -> None: 2025-05-07T20:31:57.7090319Z torch.manual_seed(2025) 2025-05-07T20:31:57.7090390Z 2025-05-07T20:31:57.7090559Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.7090640Z 2025-05-07T20:31:57.7090733Z x_sign = torch.sign(x) 2025-05-07T20:31:57.7090858Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.7090959Z x = x_sign * x_clamp 2025-05-07T20:31:57.7091040Z x0 = x[:, :D] 2025-05-07T20:31:57.7091120Z x1 = x[:, D:] 2025-05-07T20:31:57.7091200Z 2025-05-07T20:31:57.7091284Z if contiguous: 2025-05-07T20:31:57.7091383Z x0 = x0.contiguous() 2025-05-07T20:31:57.7091473Z x1 = x1.contiguous() 2025-05-07T20:31:57.7091545Z 2025-05-07T20:31:57.7091723Z if scale_ub is not None: 2025-05-07T20:31:57.7091893Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.7092031Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.7092114Z ) 2025-05-07T20:31:57.7092196Z else: 2025-05-07T20:31:57.7092290Z scale_ub_tensor = None 2025-05-07T20:31:57.7092371Z 2025-05-07T20:31:57.7092502Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.7092593Z op = silu_mul_quant 2025-05-07T20:31:57.7092684Z if compiled: 2025-05-07T20:31:57.7092782Z op = torch.compile(op) 2025-05-07T20:31:57.7092894Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.7092967Z 2025-05-07T20:31:57.7093057Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.7093061Z 2025-05-07T20:31:57.7093166Z moe/activation_test.py:117: 2025-05-07T20:31:57.7093298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.7093404Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.7093512Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.7094038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.7094136Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.7094513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.7094741Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.7095098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.7095192Z kernel = self.compile( 2025-05-07T20:31:57.7095591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.7095776Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.7095911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.7095915Z 2025-05-07T20:31:57.7096217Z self = 2025-05-07T20:31:57.7097031Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.7097550Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4474ab2fc0>} 2025-05-07T20:31:57.7098336Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.7098533Z context = 2025-05-07T20:31:57.7098538Z 2025-05-07T20:31:57.7098712Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.7098986Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.7099093Z module_map=module_map) 2025-05-07T20:31:57.7099263Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.7099362Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.7099446Z E ^ 2025-05-07T20:31:57.7099814Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.7099819Z 2025-05-07T20:31:57.7100243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.7100248Z 2025-05-07T20:31:57.7100356Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.7100663Z self=, 2025-05-07T20:31:57.7100745Z T=2048, 2025-05-07T20:31:57.7100821Z D=7168, 2025-05-07T20:31:57.7100904Z scale_ub=None, 2025-05-07T20:31:57.7100998Z contiguous=False, 2025-05-07T20:31:57.7101084Z compiled=False, 2025-05-07T20:31:57.7101158Z ) 2025-05-07T20:31:57.7101401Z self = 2025-05-07T20:31:57.7101609Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:57.7101615Z 2025-05-07T20:31:57.7101695Z @given( 2025-05-07T20:31:57.7101820Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.7101919Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.7102041Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.7102161Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.7102274Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.7102358Z ) 2025-05-07T20:31:57.7102612Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.7102705Z def test_silu_mul_quant( 2025-05-07T20:31:57.7102791Z self, 2025-05-07T20:31:57.7102868Z T: int, 2025-05-07T20:31:57.7102944Z D: int, 2025-05-07T20:31:57.7103054Z scale_ub: Optional[float], 2025-05-07T20:31:57.7103142Z contiguous: bool, 2025-05-07T20:31:57.7103227Z compiled: bool, 2025-05-07T20:31:57.7103311Z ) -> None: 2025-05-07T20:31:57.7103406Z torch.manual_seed(2025) 2025-05-07T20:31:57.7103478Z 2025-05-07T20:31:57.7103655Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.7105591Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.7105609Z 2025-05-07T20:31:57.7105732Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:57.7105737Z 2025-05-07T20:31:57.7105839Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.7106072Z self=, 2025-05-07T20:31:57.7106442Z T=128, 2025-05-07T20:31:57.7106558Z D=7168, 2025-05-07T20:31:57.7106693Z scale_ub=1200.0, 2025-05-07T20:31:57.7106794Z contiguous=True, 2025-05-07T20:31:57.7106877Z compiled=True, 2025-05-07T20:31:57.7106956Z ) 2025-05-07T20:31:57.7107181Z self = 2025-05-07T20:31:57.7107366Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:57.7107371Z 2025-05-07T20:31:57.7107448Z @given( 2025-05-07T20:31:57.7107569Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.7107680Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.7107796Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.7107916Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.7108037Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.7108111Z ) 2025-05-07T20:31:57.7108364Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.7108464Z def test_silu_mul_quant( 2025-05-07T20:31:57.7108540Z self, 2025-05-07T20:31:57.7108623Z T: int, 2025-05-07T20:31:57.7108700Z D: int, 2025-05-07T20:31:57.7108798Z scale_ub: Optional[float], 2025-05-07T20:31:57.7108895Z contiguous: bool, 2025-05-07T20:31:57.7109224Z compiled: bool, 2025-05-07T20:31:57.7109300Z ) -> None: 2025-05-07T20:31:57.7109399Z torch.manual_seed(2025) 2025-05-07T20:31:57.7109470Z 2025-05-07T20:31:57.7109643Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.7109727Z 2025-05-07T20:31:57.7109823Z x_sign = torch.sign(x) 2025-05-07T20:31:57.7109949Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.7110046Z x = x_sign * x_clamp 2025-05-07T20:31:57.7110125Z x0 = x[:, :D] 2025-05-07T20:31:57.7110207Z x1 = x[:, D:] 2025-05-07T20:31:57.7110285Z 2025-05-07T20:31:57.7110369Z if contiguous: 2025-05-07T20:31:57.7110468Z x0 = x0.contiguous() 2025-05-07T20:31:57.7110560Z x1 = x1.contiguous() 2025-05-07T20:31:57.7110632Z 2025-05-07T20:31:57.7110725Z if scale_ub is not None: 2025-05-07T20:31:57.7110830Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.7110975Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.7111058Z ) 2025-05-07T20:31:57.7111133Z else: 2025-05-07T20:31:57.7111227Z scale_ub_tensor = None 2025-05-07T20:31:57.7111305Z 2025-05-07T20:31:57.7111441Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.7111532Z op = silu_mul_quant 2025-05-07T20:31:57.7111622Z if compiled: 2025-05-07T20:31:57.7111721Z op = torch.compile(op) 2025-05-07T20:31:57.7111835Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.7111908Z 2025-05-07T20:31:57.7112020Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.7112026Z 2025-05-07T20:31:57.7112137Z moe/activation_test.py:117: 2025-05-07T20:31:57.7112286Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.7112387Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.7112493Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.7112879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.7112972Z return fn(*args, **kwargs) 2025-05-07T20:31:57.7113620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.7113721Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.7114098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.7114325Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.7114677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.7114776Z kernel = self.compile( 2025-05-07T20:31:57.7115171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.7115360Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.7115491Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.7115496Z 2025-05-07T20:31:57.7115708Z self = 2025-05-07T20:31:57.7116523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.7117039Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f44747c5120>} 2025-05-07T20:31:57.7117824Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.7118097Z context = 2025-05-07T20:31:57.7118102Z 2025-05-07T20:31:57.7118275Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.7118555Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.7118661Z module_map=module_map) 2025-05-07T20:31:57.7118830Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.7118928Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.7119005Z E ^ 2025-05-07T20:31:57.7119378Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.7119383Z 2025-05-07T20:31:57.7119811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.7119822Z 2025-05-07T20:31:57.7119928Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.7120159Z self=, 2025-05-07T20:31:57.7120236Z T=128, 2025-05-07T20:31:57.7120316Z D=7168, 2025-05-07T20:31:57.7120406Z scale_ub=1200.0, 2025-05-07T20:31:57.7120490Z contiguous=True, 2025-05-07T20:31:57.7120582Z compiled=False, 2025-05-07T20:31:57.7120654Z ) 2025-05-07T20:31:57.7120876Z self = 2025-05-07T20:31:57.7121056Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:57.7121060Z 2025-05-07T20:31:57.7121136Z @given( 2025-05-07T20:31:57.7121263Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.7121364Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.7121479Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.7121601Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.7121724Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.7121803Z ) 2025-05-07T20:31:57.7122060Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.7122272Z def test_silu_mul_quant( 2025-05-07T20:31:57.7122364Z self, 2025-05-07T20:31:57.7122461Z T: int, 2025-05-07T20:31:57.7122550Z D: int, 2025-05-07T20:31:57.7122655Z scale_ub: Optional[float], 2025-05-07T20:31:57.7122746Z contiguous: bool, 2025-05-07T20:31:57.7122832Z compiled: bool, 2025-05-07T20:31:57.7122918Z ) -> None: 2025-05-07T20:31:57.7123013Z torch.manual_seed(2025) 2025-05-07T20:31:57.7123086Z 2025-05-07T20:31:57.7123262Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.7123337Z 2025-05-07T20:31:57.7123429Z x_sign = torch.sign(x) 2025-05-07T20:31:57.7123560Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.7125423Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.7125429Z 2025-05-07T20:31:57.7125555Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:57.7125560Z 2025-05-07T20:31:57.7125663Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.7125897Z self=, 2025-05-07T20:31:57.7125977Z T=128, 2025-05-07T20:31:57.7126054Z D=5120, 2025-05-07T20:31:57.7126143Z scale_ub=1200.0, 2025-05-07T20:31:57.7126312Z contiguous=True, 2025-05-07T20:31:57.7126397Z compiled=True, 2025-05-07T20:31:57.7126477Z ) 2025-05-07T20:31:57.7126701Z self = 2025-05-07T20:31:57.7126878Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:57.7126883Z 2025-05-07T20:31:57.7126968Z @given( 2025-05-07T20:31:57.7127088Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.7127187Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.7127306Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.7127423Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.7127542Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.7127615Z ) 2025-05-07T20:31:57.7127866Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.7127964Z def test_silu_mul_quant( 2025-05-07T20:31:57.7128046Z self, 2025-05-07T20:31:57.7128122Z T: int, 2025-05-07T20:31:57.7128202Z D: int, 2025-05-07T20:31:57.7128298Z scale_ub: Optional[float], 2025-05-07T20:31:57.7128388Z contiguous: bool, 2025-05-07T20:31:57.7128482Z compiled: bool, 2025-05-07T20:31:57.7128561Z ) -> None: 2025-05-07T20:31:57.7128655Z torch.manual_seed(2025) 2025-05-07T20:31:57.7128733Z 2025-05-07T20:31:57.7128899Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.7128978Z 2025-05-07T20:31:57.7129070Z > x_sign = torch.sign(x) 2025-05-07T20:31:57.7130911Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.7130929Z 2025-05-07T20:31:57.7131126Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:57.7131131Z 2025-05-07T20:31:57.7131234Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.7131466Z self=, 2025-05-07T20:31:57.7131545Z T=128, 2025-05-07T20:31:57.7131620Z D=7168, 2025-05-07T20:31:57.7131708Z scale_ub=None, 2025-05-07T20:31:57.7131861Z contiguous=True, 2025-05-07T20:31:57.7131943Z compiled=True, 2025-05-07T20:31:57.7132021Z ) 2025-05-07T20:31:57.7132245Z self = 2025-05-07T20:31:57.7132421Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:57.7132425Z 2025-05-07T20:31:57.7132512Z @given( 2025-05-07T20:31:57.7132630Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.7132733Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.7132855Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.7132973Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.7133091Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.7133168Z ) 2025-05-07T20:31:57.7133419Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.7133520Z def test_silu_mul_quant( 2025-05-07T20:31:57.7133597Z self, 2025-05-07T20:31:57.7133679Z T: int, 2025-05-07T20:31:57.7133759Z D: int, 2025-05-07T20:31:57.7133856Z scale_ub: Optional[float], 2025-05-07T20:31:57.7133951Z contiguous: bool, 2025-05-07T20:31:57.7134037Z compiled: bool, 2025-05-07T20:31:57.7134115Z ) -> None: 2025-05-07T20:31:57.7134216Z torch.manual_seed(2025) 2025-05-07T20:31:57.7134369Z 2025-05-07T20:31:57.7134539Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.7136390Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:57.7136396Z 2025-05-07T20:31:57.7136515Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:57.7136657Z =============================== warnings summary =============================== 2025-05-07T20:31:57.7136976Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:31:57.7137302Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:31:57.7137619Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:31:57.7138530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:31:57.7138775Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:31:57.7138779Z 2025-05-07T20:31:57.7138966Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings 2025-05-07T20:31:57.7140377Z /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844. 2025-05-07T20:31:57.7140580Z torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3) 2025-05-07T20:31:57.7140585Z 2025-05-07T20:31:57.7140808Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:31:57.7140979Z ================== 1 failed, 1 passed, 13 warnings in 20.16s =================== 2025-05-07T20:31:59.4284684Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:31:59.4905971Z 2025-05-07T20:31:59.4906689Z [TEST] Some tests FAILED. Re-attempting only FAILED tests: ./moe/activation_test.py 2025-05-07T20:31:59.4907187Z 2025-05-07T20:31:59.4907220Z 2025-05-07T20:31:59.4928872Z [EXEC] [ATTEMPT 0/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:01.6431735Z ============================= test session starts ============================== 2025-05-07T20:32:01.6432555Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:01.6433150Z cachedir: .pytest_cache 2025-05-07T20:32:01.6433753Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:01.6434494Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:01.6434914Z plugins: hypothesis-6.131.14 2025-05-07T20:32:03.2566738Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:03.3640196Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:03.3641079Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:03.3641300Z 2025-05-07T20:32:05.4355249Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:05.4356397Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:32:05.4357789Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:05.4359284Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:05.4368308Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.4369695Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:05.4371143Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.4372234Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.4373505Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:05.4375342Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.4376457Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.4377786Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:05.4379085Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:32:05.4380357Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:05.4381619Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:32:05.4382479Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.4383543Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:05.4384609Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:32:05.4385423Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^ 2025-05-07T20:32:05.4386841Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:05.4388170Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:05.4389335Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:05.4390419Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:32:05.4391632Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:05.4393054Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:05.4394157Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.4395098Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.4395864Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:32:05.4396909Z W0507 20:32:05.433000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.4514402Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:05.4516216Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:32:05.4517610Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:05.4519086Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:05.4520093Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.4521449Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:05.4522878Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.4523892Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.4525209Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:05.4526631Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.4527868Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.4529190Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:05.4530475Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:32:05.4531730Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:05.4533097Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:32:05.4533954Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.4535014Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:05.4536075Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:32:05.4536897Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^ 2025-05-07T20:32:05.4538150Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:05.4539585Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:05.4540746Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:05.4541969Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:32:05.4543198Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:05.4544654Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:05.4545768Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.4546712Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.4547475Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:32:05.4548534Z W0507 20:32:05.450000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8625723Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8626820Z self=, 2025-05-07T20:32:05.8627240Z T=1, 2025-05-07T20:32:05.8627426Z D=5120, 2025-05-07T20:32:05.8627613Z scale_ub=None, 2025-05-07T20:32:05.8627836Z contiguous=True, 2025-05-07T20:32:05.8628056Z compiled=True, 2025-05-07T20:32:05.8628270Z ) 2025-05-07T20:32:05.8628598Z self = 2025-05-07T20:32:05.8629098Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.8629370Z 2025-05-07T20:32:05.8629452Z @given( 2025-05-07T20:32:05.8629692Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8630009Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8630328Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8630664Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8631000Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8631292Z ) 2025-05-07T20:32:05.8631653Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8632108Z def test_silu_mul_quant( 2025-05-07T20:32:05.8632350Z self, 2025-05-07T20:32:05.8632557Z T: int, 2025-05-07T20:32:05.8632765Z D: int, 2025-05-07T20:32:05.8632984Z scale_ub: Optional[float], 2025-05-07T20:32:05.8633262Z contiguous: bool, 2025-05-07T20:32:05.8633511Z compiled: bool, 2025-05-07T20:32:05.8633740Z ) -> None: 2025-05-07T20:32:05.8633962Z torch.manual_seed(2025) 2025-05-07T20:32:05.8634216Z 2025-05-07T20:32:05.8634531Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8634884Z 2025-05-07T20:32:05.8635085Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8635385Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8635697Z x = x_sign * x_clamp 2025-05-07T20:32:05.8635943Z x0 = x[:, :D] 2025-05-07T20:32:05.8636173Z x1 = x[:, D:] 2025-05-07T20:32:05.8636380Z 2025-05-07T20:32:05.8636571Z if contiguous: 2025-05-07T20:32:05.8636810Z x0 = x0.contiguous() 2025-05-07T20:32:05.8637226Z x1 = x1.contiguous() 2025-05-07T20:32:05.8637472Z 2025-05-07T20:32:05.8637669Z if scale_ub is not None: 2025-05-07T20:32:05.8637942Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8638288Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8638606Z ) 2025-05-07T20:32:05.8638798Z else: 2025-05-07T20:32:05.8639013Z scale_ub_tensor = None 2025-05-07T20:32:05.8639269Z 2025-05-07T20:32:05.8639499Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8639821Z op = silu_mul_quant 2025-05-07T20:32:05.8640079Z if compiled: 2025-05-07T20:32:05.8640330Z op = torch.compile(op) 2025-05-07T20:32:05.8640625Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8640912Z 2025-05-07T20:32:05.8641108Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.8641395Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.8641692Z 2025-05-07T20:32:05.8641939Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8642275Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.8642572Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.8642894Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.8643254Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.8643573Z 2025-05-07T20:32:05.8643779Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.8643977Z 2025-05-07T20:32:05.8644088Z moe/activation_test.py:126: 2025-05-07T20:32:05.8644439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8644782Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.8645213Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.8646035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.8646817Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.8647387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8648093Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8648798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.8649544Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.8650299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.8650961Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.8651581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.8652186Z fn() 2025-05-07T20:32:05.8652715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.8653309Z self.fn.run( 2025-05-07T20:32:05.8653793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8654339Z kernel = self.compile( 2025-05-07T20:32:05.8654888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8655562Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8655971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8656207Z 2025-05-07T20:32:05.8656429Z self = 2025-05-07T20:32:05.8657635Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8659199Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96d4960c20>} 2025-05-07T20:32:05.8660592Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8661656Z context = 2025-05-07T20:32:05.8661957Z 2025-05-07T20:32:05.8662128Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8662676Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8663158Z module_map=module_map) 2025-05-07T20:32:05.8663539Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8663907Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.8664176Z E ^ 2025-05-07T20:32:05.8664656Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8665128Z 2025-05-07T20:32:05.8665555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8666085Z 2025-05-07T20:32:05.8666196Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8666616Z self=, 2025-05-07T20:32:05.8667030Z T=2048, 2025-05-07T20:32:05.8667224Z D=5120, 2025-05-07T20:32:05.8667509Z scale_ub=1200.0, 2025-05-07T20:32:05.8667738Z contiguous=True, 2025-05-07T20:32:05.8667969Z compiled=False, 2025-05-07T20:32:05.8668173Z ) 2025-05-07T20:32:06.3050387Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:06.3051547Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:32:06.3053022Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:06.3054612Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:06.3055641Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.3057020Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:06.3058481Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.3059514Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.3060807Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:06.3062633Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.3063765Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.3065121Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:06.3066445Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:32:06.3067751Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:06.3069027Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:32:06.3069901Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.3071006Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:06.3072085Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:32:06.3072921Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^ 2025-05-07T20:32:06.3074369Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:06.3075730Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:06.3076916Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:06.3078018Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:32:06.3079266Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:06.3080716Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:06.3081841Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.3082803Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:06.3083577Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:32:06.3084660Z W0507 20:32:06.301000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:06.3934965Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:06.3936438Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:32:06.3937840Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:06.3939346Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:06.3940357Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.3941733Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:06.3943176Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.3944201Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.3945479Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:06.3947050Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.3948161Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.3949499Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:06.3950813Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:32:06.3952085Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:06.3953350Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:32:06.3954209Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.3955277Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:06.3956339Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:32:06.3957166Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^ 2025-05-07T20:32:06.3958427Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:06.3959854Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:06.3961019Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:06.3962110Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:32:06.3963339Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:06.3964771Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:06.3965889Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.3966839Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:06.3967609Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:32:06.3968664Z W0507 20:32:06.390000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:06.8475628Z self = 2025-05-07T20:32:06.8476563Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:06.8476858Z 2025-05-07T20:32:06.8476938Z @given( 2025-05-07T20:32:06.8477173Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:06.8477502Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:06.8477804Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:06.8478139Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:06.8478473Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:06.8478757Z ) 2025-05-07T20:32:06.8479113Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:06.8479571Z def test_silu_mul_quant( 2025-05-07T20:32:06.8479812Z self, 2025-05-07T20:32:06.8480012Z T: int, 2025-05-07T20:32:06.8480214Z D: int, 2025-05-07T20:32:06.8480431Z scale_ub: Optional[float], 2025-05-07T20:32:06.8480710Z contiguous: bool, 2025-05-07T20:32:06.8480961Z compiled: bool, 2025-05-07T20:32:06.8481194Z ) -> None: 2025-05-07T20:32:06.8481408Z torch.manual_seed(2025) 2025-05-07T20:32:06.8481654Z 2025-05-07T20:32:06.8481937Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:06.8482281Z 2025-05-07T20:32:06.8482479Z x_sign = torch.sign(x) 2025-05-07T20:32:06.8482775Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:06.8483087Z x = x_sign * x_clamp 2025-05-07T20:32:06.8483335Z x0 = x[:, :D] 2025-05-07T20:32:06.8483557Z x1 = x[:, D:] 2025-05-07T20:32:06.8483763Z 2025-05-07T20:32:06.8483956Z if contiguous: 2025-05-07T20:32:06.8484197Z x0 = x0.contiguous() 2025-05-07T20:32:06.8484455Z x1 = x1.contiguous() 2025-05-07T20:32:06.8484706Z 2025-05-07T20:32:06.8484905Z if scale_ub is not None: 2025-05-07T20:32:06.8485178Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:06.8485526Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:06.8485842Z ) 2025-05-07T20:32:06.8486213Z else: 2025-05-07T20:32:06.8486423Z scale_ub_tensor = None 2025-05-07T20:32:06.8486684Z 2025-05-07T20:32:06.8487081Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:06.8487402Z op = silu_mul_quant 2025-05-07T20:32:06.8487658Z if compiled: 2025-05-07T20:32:06.8487914Z op = torch.compile(op) 2025-05-07T20:32:06.8488211Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:06.8488491Z 2025-05-07T20:32:06.8488689Z > y_fp8, y_scale = fn() 2025-05-07T20:32:06.8488858Z 2025-05-07T20:32:06.8488959Z moe/activation_test.py:117: 2025-05-07T20:32:06.8489263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:06.8489603Z moe/activation_test.py:115: in fn 2025-05-07T20:32:06.8489885Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:06.8490611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:06.8491328Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:06.8491944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:06.8492645Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:06.8493333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:06.8493885Z kernel = self.compile( 2025-05-07T20:32:06.8494446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:06.8495116Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.8495526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:06.8495849Z 2025-05-07T20:32:06.8496069Z self = 2025-05-07T20:32:06.8497198Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:06.8498640Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96d4820180>} 2025-05-07T20:32:06.8500038Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:06.8501102Z context = 2025-05-07T20:32:06.8501398Z 2025-05-07T20:32:06.8501578Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:06.8502112Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.8502599Z module_map=module_map) 2025-05-07T20:32:06.8502975Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.8503340Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:06.8503602Z E ^ 2025-05-07T20:32:06.8504085Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:06.8504602Z 2025-05-07T20:32:06.8505037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:06.8505568Z 2025-05-07T20:32:06.8505674Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:06.8506100Z self=, 2025-05-07T20:32:06.8506677Z T=2048, 2025-05-07T20:32:06.8506872Z D=5120, 2025-05-07T20:32:06.8507067Z scale_ub=1200.0, 2025-05-07T20:32:06.8507295Z contiguous=True, 2025-05-07T20:32:06.8507522Z compiled=True, 2025-05-07T20:32:06.8507730Z ) 2025-05-07T20:32:06.8508190Z self = 2025-05-07T20:32:06.8508710Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:06.8508991Z 2025-05-07T20:32:06.8509069Z @given( 2025-05-07T20:32:06.8509305Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:06.8509629Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:06.8509936Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:06.8510273Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:06.8510613Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:06.8510907Z ) 2025-05-07T20:32:06.8511258Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:06.8511717Z def test_silu_mul_quant( 2025-05-07T20:32:06.8511968Z self, 2025-05-07T20:32:06.8512162Z T: int, 2025-05-07T20:32:06.8512364Z D: int, 2025-05-07T20:32:06.8512592Z scale_ub: Optional[float], 2025-05-07T20:32:06.8512864Z contiguous: bool, 2025-05-07T20:32:06.8513109Z compiled: bool, 2025-05-07T20:32:06.8513336Z ) -> None: 2025-05-07T20:32:06.8513548Z torch.manual_seed(2025) 2025-05-07T20:32:06.8513795Z 2025-05-07T20:32:06.8514071Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:06.8514412Z 2025-05-07T20:32:06.8514610Z x_sign = torch.sign(x) 2025-05-07T20:32:06.8514949Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:06.8515264Z x = x_sign * x_clamp 2025-05-07T20:32:06.8515509Z x0 = x[:, :D] 2025-05-07T20:32:06.8515730Z x1 = x[:, D:] 2025-05-07T20:32:06.8515941Z 2025-05-07T20:32:06.8516256Z if contiguous: 2025-05-07T20:32:06.8516497Z x0 = x0.contiguous() 2025-05-07T20:32:06.8516761Z x1 = x1.contiguous() 2025-05-07T20:32:06.8517000Z 2025-05-07T20:32:06.8517196Z if scale_ub is not None: 2025-05-07T20:32:06.8517480Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:06.8517814Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:06.8518130Z ) 2025-05-07T20:32:06.8518327Z else: 2025-05-07T20:32:06.8518536Z scale_ub_tensor = None 2025-05-07T20:32:06.8518799Z 2025-05-07T20:32:06.8519035Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:06.8519349Z op = silu_mul_quant 2025-05-07T20:32:06.8519607Z if compiled: 2025-05-07T20:32:06.8519860Z op = torch.compile(op) 2025-05-07T20:32:06.8520155Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:06.8520437Z 2025-05-07T20:32:06.8520635Z y_fp8, y_scale = fn() 2025-05-07T20:32:06.8520935Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:06.8521229Z 2025-05-07T20:32:06.8521473Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:06.8521820Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:06.8522115Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:06.8522437Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:06.8522803Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:06.8523116Z 2025-05-07T20:32:06.8523329Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:06.8523527Z 2025-05-07T20:32:06.8523635Z moe/activation_test.py:126: 2025-05-07T20:32:06.8523935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:06.8524277Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:06.8524626Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:06.8525503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:06.8533999Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:06.8534763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:06.8535479Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:06.8536190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:06.8536938Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:06.8537696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:06.8538360Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:06.8538977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:06.8539519Z fn() 2025-05-07T20:32:06.8540053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:06.8540661Z self.fn.run( 2025-05-07T20:32:06.8541139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:06.8541692Z kernel = self.compile( 2025-05-07T20:32:06.8542260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:06.8542935Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.8543350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:06.8543588Z 2025-05-07T20:32:06.8543810Z self = 2025-05-07T20:32:06.8545029Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:06.8546447Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96d45eaa20>} 2025-05-07T20:32:06.8547849Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:06.8548921Z context = 2025-05-07T20:32:06.8549219Z 2025-05-07T20:32:06.8549398Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:06.8549933Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.8550417Z module_map=module_map) 2025-05-07T20:32:06.8550795Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.8551166Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:06.8551432Z E ^ 2025-05-07T20:32:06.8551910Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:06.8552381Z 2025-05-07T20:32:06.8552809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:06.8553338Z 2025-05-07T20:32:06.8553451Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:06.8553871Z self=, 2025-05-07T20:32:06.8554284Z T=16384, 2025-05-07T20:32:06.8554483Z D=7168, 2025-05-07T20:32:06.8554674Z scale_ub=1200.0, 2025-05-07T20:32:06.8554906Z contiguous=False, 2025-05-07T20:32:06.8555147Z compiled=False, 2025-05-07T20:32:06.8555358Z ) 2025-05-07T20:32:07.0954429Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:07.0956284Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:32:07.0958813Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:07.0961498Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:07.0963322Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:07.0965596Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:07.0967032Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.0968049Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:07.0969320Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:07.0970875Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.0972113Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:07.0973448Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:07.0974748Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:32:07.0976014Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:07.0977282Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:32:07.0978133Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:07.0979192Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:07.0980249Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:32:07.0981070Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^ 2025-05-07T20:32:07.0982323Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:07.0983734Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:07.0984894Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:07.0985977Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:32:07.0987203Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:07.0988612Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:07.0989718Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.0990662Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.0991429Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:32:07.0992479Z W0507 20:32:07.092000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.1578194Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:07.1579467Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:32:07.1580852Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:07.1582319Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:07.1583328Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:07.1584683Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:07.1586124Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.1587140Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:07.1588411Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:07.1589849Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.1591377Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:07.1592709Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:07.1594000Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:32:07.1595309Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:07.1596566Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:32:07.1597416Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:07.1598468Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:07.1599521Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:32:07.1600337Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^ 2025-05-07T20:32:07.1601579Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:07.1603028Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:07.1604182Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:07.1605310Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:32:07.1606693Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:07.1608096Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:07.1609207Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.1610146Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.1610911Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:32:07.1612013Z W0507 20:32:07.154000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.6565750Z self = 2025-05-07T20:32:07.6566341Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:07.6566646Z 2025-05-07T20:32:07.6566726Z @given( 2025-05-07T20:32:07.6566964Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.6567279Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.6567748Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.6568089Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.6568420Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.6568708Z ) 2025-05-07T20:32:07.6569059Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.6569515Z def test_silu_mul_quant( 2025-05-07T20:32:07.6569764Z self, 2025-05-07T20:32:07.6569957Z T: int, 2025-05-07T20:32:07.6570158Z D: int, 2025-05-07T20:32:07.6570380Z scale_ub: Optional[float], 2025-05-07T20:32:07.6570649Z contiguous: bool, 2025-05-07T20:32:07.6570890Z compiled: bool, 2025-05-07T20:32:07.6571119Z ) -> None: 2025-05-07T20:32:07.6571342Z torch.manual_seed(2025) 2025-05-07T20:32:07.6571586Z 2025-05-07T20:32:07.6571923Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.6572268Z 2025-05-07T20:32:07.6572472Z x_sign = torch.sign(x) 2025-05-07T20:32:07.6572770Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.6573080Z x = x_sign * x_clamp 2025-05-07T20:32:07.6573326Z x0 = x[:, :D] 2025-05-07T20:32:07.6573547Z x1 = x[:, D:] 2025-05-07T20:32:07.6573758Z 2025-05-07T20:32:07.6573942Z if contiguous: 2025-05-07T20:32:07.6574182Z x0 = x0.contiguous() 2025-05-07T20:32:07.6574446Z x1 = x1.contiguous() 2025-05-07T20:32:07.6574684Z 2025-05-07T20:32:07.6574877Z if scale_ub is not None: 2025-05-07T20:32:07.6575154Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.6575489Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.6575935Z ) 2025-05-07T20:32:07.6576133Z else: 2025-05-07T20:32:07.6576343Z scale_ub_tensor = None 2025-05-07T20:32:07.6576599Z 2025-05-07T20:32:07.6576835Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.6577155Z op = silu_mul_quant 2025-05-07T20:32:07.6577413Z if compiled: 2025-05-07T20:32:07.6577666Z op = torch.compile(op) 2025-05-07T20:32:07.6577961Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.6578243Z 2025-05-07T20:32:07.6578447Z > y_fp8, y_scale = fn() 2025-05-07T20:32:07.6578614Z 2025-05-07T20:32:07.6578725Z moe/activation_test.py:117: 2025-05-07T20:32:07.6579025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.6579368Z moe/activation_test.py:115: in fn 2025-05-07T20:32:07.6579657Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.6580364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:07.6581087Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:07.6581646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.6582353Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.6583036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.6583590Z kernel = self.compile( 2025-05-07T20:32:07.6584151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.6584830Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.6585246Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.6585486Z 2025-05-07T20:32:07.6585696Z self = 2025-05-07T20:32:07.6586910Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.6588331Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96cf06b6a0>} 2025-05-07T20:32:07.6589721Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.6590783Z context = 2025-05-07T20:32:07.6591084Z 2025-05-07T20:32:07.6591260Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.6591812Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.6592291Z module_map=module_map) 2025-05-07T20:32:07.6592675Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.6593040Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.6593302Z E ^ 2025-05-07T20:32:07.6593783Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.6594250Z 2025-05-07T20:32:07.6594688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:07.6595223Z 2025-05-07T20:32:07.6595334Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.6595755Z self=, 2025-05-07T20:32:07.6596172Z T=1, 2025-05-07T20:32:07.6596361Z D=7168, 2025-05-07T20:32:07.6596557Z scale_ub=None, 2025-05-07T20:32:07.6596864Z contiguous=True, 2025-05-07T20:32:07.6597093Z compiled=True, 2025-05-07T20:32:07.6597301Z ) 2025-05-07T20:32:07.6597632Z self = 2025-05-07T20:32:07.6598143Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:07.6598410Z 2025-05-07T20:32:07.6598490Z @given( 2025-05-07T20:32:07.6598734Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.6599061Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.6599370Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.6599706Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.6600045Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.6600337Z ) 2025-05-07T20:32:07.6600692Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.6601153Z def test_silu_mul_quant( 2025-05-07T20:32:07.6601411Z self, 2025-05-07T20:32:07.6601607Z T: int, 2025-05-07T20:32:07.6601806Z D: int, 2025-05-07T20:32:07.6602034Z scale_ub: Optional[float], 2025-05-07T20:32:07.6602309Z contiguous: bool, 2025-05-07T20:32:07.6602560Z compiled: bool, 2025-05-07T20:32:07.6602788Z ) -> None: 2025-05-07T20:32:07.6603004Z torch.manual_seed(2025) 2025-05-07T20:32:07.6603249Z 2025-05-07T20:32:07.6603528Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.6603872Z 2025-05-07T20:32:07.6604068Z x_sign = torch.sign(x) 2025-05-07T20:32:07.6604365Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.6604681Z x = x_sign * x_clamp 2025-05-07T20:32:07.6604920Z x0 = x[:, :D] 2025-05-07T20:32:07.6605139Z x1 = x[:, D:] 2025-05-07T20:32:07.6605351Z 2025-05-07T20:32:07.6605534Z if contiguous: 2025-05-07T20:32:07.6605768Z x0 = x0.contiguous() 2025-05-07T20:32:07.6606036Z x1 = x1.contiguous() 2025-05-07T20:32:07.6606430Z 2025-05-07T20:32:07.6606625Z if scale_ub is not None: 2025-05-07T20:32:07.6606903Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.6607362Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.6607680Z ) 2025-05-07T20:32:07.6607882Z else: 2025-05-07T20:32:07.6608090Z scale_ub_tensor = None 2025-05-07T20:32:07.6608343Z 2025-05-07T20:32:07.6608580Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.6608899Z op = silu_mul_quant 2025-05-07T20:32:07.6609155Z if compiled: 2025-05-07T20:32:07.6609406Z op = torch.compile(op) 2025-05-07T20:32:07.6609708Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.6609981Z 2025-05-07T20:32:07.6610177Z y_fp8, y_scale = fn() 2025-05-07T20:32:07.6610470Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:07.6610769Z 2025-05-07T20:32:07.6611011Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.6611354Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:07.6611652Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:07.6612011Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:07.6612376Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:07.6612686Z 2025-05-07T20:32:07.6612893Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:07.6613089Z 2025-05-07T20:32:07.6613196Z moe/activation_test.py:126: 2025-05-07T20:32:07.6613502Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.6613840Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:07.6614178Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:07.6615020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:07.6615940Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:07.6616510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.6617217Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.6617935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:07.6618676Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:07.6619434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:07.6620100Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:07.6620728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:07.6621269Z fn() 2025-05-07T20:32:07.6621794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:07.6622400Z self.fn.run( 2025-05-07T20:32:07.6622883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.6623433Z kernel = self.compile( 2025-05-07T20:32:07.6623990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.6624665Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.6625067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.6625308Z 2025-05-07T20:32:07.6625519Z self = 2025-05-07T20:32:07.6626641Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.6628230Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96cec65620>} 2025-05-07T20:32:07.6629627Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.6630697Z context = 2025-05-07T20:32:07.6631001Z 2025-05-07T20:32:07.6631170Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.6631712Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.6632192Z module_map=module_map) 2025-05-07T20:32:07.6632569Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.6632935Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:07.6633200Z E ^ 2025-05-07T20:32:07.6633685Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.6634160Z 2025-05-07T20:32:07.6634640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:07.6635174Z 2025-05-07T20:32:07.6635285Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.6635706Z self=, 2025-05-07T20:32:07.6636121Z T=4096, 2025-05-07T20:32:07.6636315Z D=5120, 2025-05-07T20:32:07.6636503Z scale_ub=None, 2025-05-07T20:32:07.6636723Z contiguous=False, 2025-05-07T20:32:07.6636954Z compiled=False, 2025-05-07T20:32:07.6637170Z ) 2025-05-07T20:32:08.1114208Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:08.1115319Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:32:08.1116703Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:08.1118169Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:08.1119166Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:08.1120526Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:08.1121963Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.1122980Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:08.1124249Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:08.1125676Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.1126950Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:08.1128286Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:08.1129590Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:32:08.1130864Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:08.1132176Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:32:08.1133046Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:08.1134112Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:08.1135223Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:32:08.1136041Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^ 2025-05-07T20:32:08.1137303Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:08.1138779Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:08.1139943Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:08.1141028Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:32:08.1142252Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:08.1143666Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:08.1144824Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.1145778Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.1146548Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:32:08.1147603Z W0507 20:32:08.108000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.3224890Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:08.3227088Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:32:08.3230127Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:08.3233053Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:08.3234798Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:08.3236150Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:08.3237590Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.3238602Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:08.3239867Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:08.3241287Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.3242504Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:08.3243834Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:08.3245182Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:32:08.3246456Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:08.3247712Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:32:08.3248572Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:08.3249642Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:08.3250703Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:32:08.3251528Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^ 2025-05-07T20:32:08.3252819Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:08.3254151Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:08.3255390Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:08.3256468Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:32:08.3257679Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:08.3259076Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:08.3260174Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.3261116Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.3261873Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:32:08.3262915Z W0507 20:32:08.319000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.9081633Z self = 2025-05-07T20:32:08.9082743Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:08.9083321Z 2025-05-07T20:32:08.9083476Z @given( 2025-05-07T20:32:08.9083925Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.9084869Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.9085264Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.9085625Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.9085946Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.9086229Z ) 2025-05-07T20:32:08.9086578Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.9087025Z def test_silu_mul_quant( 2025-05-07T20:32:08.9087261Z self, 2025-05-07T20:32:08.9087455Z T: int, 2025-05-07T20:32:08.9087652Z D: int, 2025-05-07T20:32:08.9087860Z scale_ub: Optional[float], 2025-05-07T20:32:08.9088128Z contiguous: bool, 2025-05-07T20:32:08.9088370Z compiled: bool, 2025-05-07T20:32:08.9088597Z ) -> None: 2025-05-07T20:32:08.9095097Z torch.manual_seed(2025) 2025-05-07T20:32:08.9095352Z 2025-05-07T20:32:08.9095636Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.9095991Z 2025-05-07T20:32:08.9096184Z x_sign = torch.sign(x) 2025-05-07T20:32:08.9096486Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.9096797Z x = x_sign * x_clamp 2025-05-07T20:32:08.9097044Z x0 = x[:, :D] 2025-05-07T20:32:08.9097267Z x1 = x[:, D:] 2025-05-07T20:32:08.9097480Z 2025-05-07T20:32:08.9097666Z if contiguous: 2025-05-07T20:32:08.9097901Z x0 = x0.contiguous() 2025-05-07T20:32:08.9098167Z x1 = x1.contiguous() 2025-05-07T20:32:08.9098407Z 2025-05-07T20:32:08.9098601Z if scale_ub is not None: 2025-05-07T20:32:08.9098878Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.9099219Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.9099545Z ) 2025-05-07T20:32:08.9099753Z else: 2025-05-07T20:32:08.9099966Z scale_ub_tensor = None 2025-05-07T20:32:08.9100232Z 2025-05-07T20:32:08.9100474Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.9100791Z op = silu_mul_quant 2025-05-07T20:32:08.9101049Z if compiled: 2025-05-07T20:32:08.9101457Z op = torch.compile(op) 2025-05-07T20:32:08.9101773Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.9102048Z 2025-05-07T20:32:08.9102247Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.9102415Z 2025-05-07T20:32:08.9102525Z moe/activation_test.py:117: 2025-05-07T20:32:08.9102834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.9103180Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.9103469Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.9104188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.9104936Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.9105511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.9106535Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.9107304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.9107858Z kernel = self.compile( 2025-05-07T20:32:08.9108417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.9109092Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.9109499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.9109740Z 2025-05-07T20:32:08.9109954Z self = 2025-05-07T20:32:08.9111085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.9112663Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96cec665c0>} 2025-05-07T20:32:08.9114050Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.9115113Z context = 2025-05-07T20:32:08.9115417Z 2025-05-07T20:32:08.9115588Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.9116132Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.9116616Z module_map=module_map) 2025-05-07T20:32:08.9116993Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.9117355Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.9117618Z E ^ 2025-05-07T20:32:08.9118099Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.9118570Z 2025-05-07T20:32:08.9118999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.9119530Z 2025-05-07T20:32:08.9119643Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.9120070Z self=, 2025-05-07T20:32:08.9120479Z T=4096, 2025-05-07T20:32:08.9120672Z D=7168, 2025-05-07T20:32:08.9120870Z scale_ub=None, 2025-05-07T20:32:08.9121085Z contiguous=False, 2025-05-07T20:32:08.9121320Z compiled=False, 2025-05-07T20:32:08.9121534Z ) 2025-05-07T20:32:08.9121859Z self = 2025-05-07T20:32:08.9122370Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:08.9122771Z 2025-05-07T20:32:08.9122857Z @given( 2025-05-07T20:32:08.9123089Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.9123411Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.9123726Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.9124064Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.9124395Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.9124686Z ) 2025-05-07T20:32:08.9125071Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.9125544Z def test_silu_mul_quant( 2025-05-07T20:32:08.9125800Z self, 2025-05-07T20:32:08.9126003Z T: int, 2025-05-07T20:32:08.9126208Z D: int, 2025-05-07T20:32:08.9126430Z scale_ub: Optional[float], 2025-05-07T20:32:08.9126709Z contiguous: bool, 2025-05-07T20:32:08.9126957Z compiled: bool, 2025-05-07T20:32:08.9127184Z ) -> None: 2025-05-07T20:32:08.9127411Z torch.manual_seed(2025) 2025-05-07T20:32:08.9127657Z 2025-05-07T20:32:08.9127938Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.9128289Z 2025-05-07T20:32:08.9128484Z x_sign = torch.sign(x) 2025-05-07T20:32:08.9128778Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.9129098Z x = x_sign * x_clamp 2025-05-07T20:32:08.9129342Z x0 = x[:, :D] 2025-05-07T20:32:08.9129563Z x1 = x[:, D:] 2025-05-07T20:32:08.9129771Z 2025-05-07T20:32:08.9129963Z if contiguous: 2025-05-07T20:32:08.9130193Z x0 = x0.contiguous() 2025-05-07T20:32:08.9130457Z x1 = x1.contiguous() 2025-05-07T20:32:08.9130700Z 2025-05-07T20:32:08.9130976Z if scale_ub is not None: 2025-05-07T20:32:08.9131253Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.9131593Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.9131956Z ) 2025-05-07T20:32:08.9132160Z else: 2025-05-07T20:32:08.9132376Z scale_ub_tensor = None 2025-05-07T20:32:08.9132627Z 2025-05-07T20:32:08.9132867Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.9133190Z op = silu_mul_quant 2025-05-07T20:32:08.9133440Z if compiled: 2025-05-07T20:32:08.9133696Z op = torch.compile(op) 2025-05-07T20:32:08.9133999Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.9134284Z 2025-05-07T20:32:08.9134477Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.9134651Z 2025-05-07T20:32:08.9134754Z moe/activation_test.py:117: 2025-05-07T20:32:08.9135074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.9135459Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.9135751Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.9136467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.9137175Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.9137729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.9138434Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.9139121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.9139670Z kernel = self.compile( 2025-05-07T20:32:08.9140226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.9140904Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.9141316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.9141553Z 2025-05-07T20:32:08.9141849Z self = 2025-05-07T20:32:08.9142980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.9144405Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96cec676a0>} 2025-05-07T20:32:08.9145800Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.9146858Z context = 2025-05-07T20:32:08.9147160Z 2025-05-07T20:32:08.9147331Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.9147875Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.9148355Z module_map=module_map) 2025-05-07T20:32:08.9148723Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.9149087Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.9149349Z E ^ 2025-05-07T20:32:08.9149827Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.9150302Z 2025-05-07T20:32:08.9150730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.9151266Z 2025-05-07T20:32:08.9151372Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.9151883Z self=, 2025-05-07T20:32:08.9152297Z T=128, 2025-05-07T20:32:08.9152491Z D=7168, 2025-05-07T20:32:08.9152688Z scale_ub=None, 2025-05-07T20:32:08.9152909Z contiguous=False, 2025-05-07T20:32:08.9153141Z compiled=True, 2025-05-07T20:32:08.9153348Z ) 2025-05-07T20:32:08.9702829Z self = 2025-05-07T20:32:08.9703908Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:08.9704481Z 2025-05-07T20:32:08.9704642Z @given( 2025-05-07T20:32:08.9704961Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.9705323Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.9705625Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.9705957Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.9706418Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.9706711Z ) 2025-05-07T20:32:08.9707068Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.9707519Z def test_silu_mul_quant( 2025-05-07T20:32:08.9707770Z self, 2025-05-07T20:32:08.9707970Z T: int, 2025-05-07T20:32:08.9708172Z D: int, 2025-05-07T20:32:08.9708388Z scale_ub: Optional[float], 2025-05-07T20:32:08.9708654Z contiguous: bool, 2025-05-07T20:32:08.9708896Z compiled: bool, 2025-05-07T20:32:08.9709120Z ) -> None: 2025-05-07T20:32:08.9709332Z torch.manual_seed(2025) 2025-05-07T20:32:08.9709574Z 2025-05-07T20:32:08.9709851Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.9710192Z 2025-05-07T20:32:08.9710387Z x_sign = torch.sign(x) 2025-05-07T20:32:08.9710676Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.9710981Z x = x_sign * x_clamp 2025-05-07T20:32:08.9711233Z x0 = x[:, :D] 2025-05-07T20:32:08.9711451Z x1 = x[:, D:] 2025-05-07T20:32:08.9711655Z 2025-05-07T20:32:08.9711846Z if contiguous: 2025-05-07T20:32:08.9712080Z x0 = x0.contiguous() 2025-05-07T20:32:08.9712497Z x1 = x1.contiguous() 2025-05-07T20:32:08.9712741Z 2025-05-07T20:32:08.9712937Z if scale_ub is not None: 2025-05-07T20:32:08.9713208Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.9713541Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.9713858Z ) 2025-05-07T20:32:08.9714050Z else: 2025-05-07T20:32:08.9714258Z scale_ub_tensor = None 2025-05-07T20:32:08.9714513Z 2025-05-07T20:32:08.9714746Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.9715060Z op = silu_mul_quant 2025-05-07T20:32:08.9715312Z if compiled: 2025-05-07T20:32:08.9715560Z op = torch.compile(op) 2025-05-07T20:32:08.9715858Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.9716139Z 2025-05-07T20:32:08.9716340Z y_fp8, y_scale = fn() 2025-05-07T20:32:08.9716628Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:08.9716923Z 2025-05-07T20:32:08.9717161Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.9717497Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:08.9717787Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:08.9718104Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:08.9718471Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:08.9718778Z 2025-05-07T20:32:08.9718983Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:08.9719186Z 2025-05-07T20:32:08.9719287Z moe/activation_test.py:126: 2025-05-07T20:32:08.9719591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.9719930Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:08.9720411Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:08.9721226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:08.9721998Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:08.9722555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.9723255Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.9723966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:08.9724704Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:08.9725455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:08.9726119Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:08.9726743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:08.9727276Z fn() 2025-05-07T20:32:08.9727803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:08.9728405Z self.fn.run( 2025-05-07T20:32:08.9728879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.9729428Z kernel = self.compile( 2025-05-07T20:32:08.9729982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.9730652Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.9731057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.9731301Z 2025-05-07T20:32:08.9731515Z self = 2025-05-07T20:32:08.9732786Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.9734211Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96ceaf39c0>} 2025-05-07T20:32:08.9735647Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.9736713Z context = 2025-05-07T20:32:08.9737021Z 2025-05-07T20:32:08.9737192Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.9739966Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.9740461Z module_map=module_map) 2025-05-07T20:32:08.9740845Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.9741205Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:08.9741477Z E ^ 2025-05-07T20:32:08.9741961Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.9742428Z 2025-05-07T20:32:08.9742871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.9743409Z 2025-05-07T20:32:08.9743514Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.9743943Z self=, 2025-05-07T20:32:08.9744359Z T=128, 2025-05-07T20:32:08.9744652Z D=7168, 2025-05-07T20:32:08.9744848Z scale_ub=None, 2025-05-07T20:32:08.9745103Z contiguous=False, 2025-05-07T20:32:08.9745351Z compiled=False, 2025-05-07T20:32:08.9745557Z ) 2025-05-07T20:32:09.1694861Z self = 2025-05-07T20:32:09.1695513Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:09.1695812Z 2025-05-07T20:32:09.1695895Z @given( 2025-05-07T20:32:09.1696133Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.1696454Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.1696760Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.1697096Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.1697435Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.1697718Z ) 2025-05-07T20:32:09.1698074Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.1698540Z def test_silu_mul_quant( 2025-05-07T20:32:09.1698785Z self, 2025-05-07T20:32:09.1698986Z T: int, 2025-05-07T20:32:09.1699187Z D: int, 2025-05-07T20:32:09.1699412Z scale_ub: Optional[float], 2025-05-07T20:32:09.1699688Z contiguous: bool, 2025-05-07T20:32:09.1699935Z compiled: bool, 2025-05-07T20:32:09.1700167Z ) -> None: 2025-05-07T20:32:09.1700382Z torch.manual_seed(2025) 2025-05-07T20:32:09.1700635Z 2025-05-07T20:32:09.1700918Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.1701264Z 2025-05-07T20:32:09.1701461Z x_sign = torch.sign(x) 2025-05-07T20:32:09.1701758Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.1702076Z x = x_sign * x_clamp 2025-05-07T20:32:09.1702328Z x0 = x[:, :D] 2025-05-07T20:32:09.1702551Z x1 = x[:, D:] 2025-05-07T20:32:09.1702763Z 2025-05-07T20:32:09.1702957Z if contiguous: 2025-05-07T20:32:09.1703200Z x0 = x0.contiguous() 2025-05-07T20:32:09.1703459Z x1 = x1.contiguous() 2025-05-07T20:32:09.1703702Z 2025-05-07T20:32:09.1703900Z if scale_ub is not None: 2025-05-07T20:32:09.1704348Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.1704699Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.1705020Z ) 2025-05-07T20:32:09.1705213Z else: 2025-05-07T20:32:09.1705432Z scale_ub_tensor = None 2025-05-07T20:32:09.1705696Z 2025-05-07T20:32:09.1705937Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.1706412Z op = silu_mul_quant 2025-05-07T20:32:09.1706671Z if compiled: 2025-05-07T20:32:09.1706927Z op = torch.compile(op) 2025-05-07T20:32:09.1707230Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.1707516Z 2025-05-07T20:32:09.1707715Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.1707889Z 2025-05-07T20:32:09.1707991Z moe/activation_test.py:117: 2025-05-07T20:32:09.1708297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.1708640Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.1708927Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.1709650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.1710372Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.1710930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.1711633Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.1712318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.1712868Z kernel = self.compile( 2025-05-07T20:32:09.1713558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.1714234Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.1714652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.1714893Z 2025-05-07T20:32:09.1715114Z self = 2025-05-07T20:32:09.1716233Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.1717663Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96cf259800>} 2025-05-07T20:32:09.1719057Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.1720126Z context = 2025-05-07T20:32:09.1720423Z 2025-05-07T20:32:09.1720600Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.1721136Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.1721617Z module_map=module_map) 2025-05-07T20:32:09.1721986Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.1722346Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.1722610Z E ^ 2025-05-07T20:32:09.1723089Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.1723555Z 2025-05-07T20:32:09.1723991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.1724529Z 2025-05-07T20:32:09.1724633Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.1725212Z self=, 2025-05-07T20:32:09.1725673Z T=4096, 2025-05-07T20:32:09.1725865Z D=5120, 2025-05-07T20:32:09.1726056Z scale_ub=1200.0, 2025-05-07T20:32:09.1726288Z contiguous=True, 2025-05-07T20:32:09.1726515Z compiled=False, 2025-05-07T20:32:09.1726717Z ) 2025-05-07T20:32:09.1727043Z self = 2025-05-07T20:32:09.1727582Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:09.1727866Z 2025-05-07T20:32:09.1727947Z @given( 2025-05-07T20:32:09.1728174Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.1728493Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.1728810Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.1729143Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.1729479Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.1729770Z ) 2025-05-07T20:32:09.1730138Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.1730588Z def test_silu_mul_quant( 2025-05-07T20:32:09.1730836Z self, 2025-05-07T20:32:09.1731035Z T: int, 2025-05-07T20:32:09.1731230Z D: int, 2025-05-07T20:32:09.1731451Z scale_ub: Optional[float], 2025-05-07T20:32:09.1731732Z contiguous: bool, 2025-05-07T20:32:09.1732033Z compiled: bool, 2025-05-07T20:32:09.1732258Z ) -> None: 2025-05-07T20:32:09.1732476Z torch.manual_seed(2025) 2025-05-07T20:32:09.1732719Z 2025-05-07T20:32:09.1733000Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.1733348Z 2025-05-07T20:32:09.1733629Z x_sign = torch.sign(x) 2025-05-07T20:32:09.1733925Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.1734240Z x = x_sign * x_clamp 2025-05-07T20:32:09.1734485Z x0 = x[:, :D] 2025-05-07T20:32:09.1734712Z x1 = x[:, D:] 2025-05-07T20:32:09.1734922Z 2025-05-07T20:32:09.1735129Z if contiguous: 2025-05-07T20:32:09.1735388Z x0 = x0.contiguous() 2025-05-07T20:32:09.1735669Z x1 = x1.contiguous() 2025-05-07T20:32:09.1735913Z 2025-05-07T20:32:09.1736102Z if scale_ub is not None: 2025-05-07T20:32:09.1736377Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.1736717Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.1737027Z ) 2025-05-07T20:32:09.1737225Z else: 2025-05-07T20:32:09.1737443Z scale_ub_tensor = None 2025-05-07T20:32:09.1737690Z 2025-05-07T20:32:09.1737920Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.1738251Z op = silu_mul_quant 2025-05-07T20:32:09.1738503Z if compiled: 2025-05-07T20:32:09.1738759Z op = torch.compile(op) 2025-05-07T20:32:09.1739062Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.1739341Z 2025-05-07T20:32:09.1739544Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.1739712Z 2025-05-07T20:32:09.1739818Z moe/activation_test.py:117: 2025-05-07T20:32:09.1740120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.1746932Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.1747274Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.1747999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.1748710Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.1749266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.1749975Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.1750771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.1751331Z kernel = self.compile( 2025-05-07T20:32:09.1751898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.1752573Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.1752988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.1753233Z 2025-05-07T20:32:09.1753448Z self = 2025-05-07T20:32:09.1754583Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.1756061Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96d49fb380>} 2025-05-07T20:32:09.1757455Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.1758517Z context = 2025-05-07T20:32:09.1758815Z 2025-05-07T20:32:09.1758992Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.1759532Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.1760006Z module_map=module_map) 2025-05-07T20:32:09.1760382Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.1760829Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.1761093Z E ^ 2025-05-07T20:32:09.1761578Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.1762051Z 2025-05-07T20:32:09.1762483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.1763012Z 2025-05-07T20:32:09.1763122Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.1763547Z self=, 2025-05-07T20:32:09.1763960Z T=1, 2025-05-07T20:32:09.1764147Z D=5120, 2025-05-07T20:32:09.1764345Z scale_ub=None, 2025-05-07T20:32:09.1764561Z contiguous=True, 2025-05-07T20:32:09.1764786Z compiled=True, 2025-05-07T20:32:09.1764995Z ) 2025-05-07T20:32:09.4060909Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:09.4062046Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:32:09.4063461Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:09.4064961Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:09.4065980Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.4067346Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:09.4068980Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.4070019Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.4071295Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:09.4072740Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.4073854Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.4075192Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:09.4076494Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:32:09.4077772Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:09.4079031Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:32:09.4080011Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.4081080Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:09.4082142Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:32:09.4082967Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^ 2025-05-07T20:32:09.4084232Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:09.4085626Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:09.4086805Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:09.4087893Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:32:09.4089123Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:09.4090537Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:09.4091651Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.4092784Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.4093556Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:32:09.4094616Z W0507 20:32:09.403000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.4751453Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:09.4752553Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:32:09.4753950Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:09.4755481Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:09.4756500Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.4757855Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:09.4759457Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.4760481Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.4761762Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:09.4763200Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.4764306Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.4765705Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:09.4767003Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:32:09.4768275Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:09.4769547Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:32:09.4770414Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.4771589Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:09.4772712Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:32:09.4773541Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^ 2025-05-07T20:32:09.4774804Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:09.4776193Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:09.4777367Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:09.4778458Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:32:09.4779691Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:09.4781109Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:09.4782219Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.4783279Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.4784060Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:32:09.4785124Z W0507 20:32:09.472000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.7682244Z self = 2025-05-07T20:32:09.7682755Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:09.7683028Z 2025-05-07T20:32:09.7683103Z @given( 2025-05-07T20:32:09.7683341Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.7683657Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.7683969Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.7684324Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.7684655Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.7684944Z ) 2025-05-07T20:32:09.7685298Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.7685805Z def test_silu_mul_quant( 2025-05-07T20:32:09.7686049Z self, 2025-05-07T20:32:09.7686242Z T: int, 2025-05-07T20:32:09.7686442Z D: int, 2025-05-07T20:32:09.7686660Z scale_ub: Optional[float], 2025-05-07T20:32:09.7686929Z contiguous: bool, 2025-05-07T20:32:09.7687175Z compiled: bool, 2025-05-07T20:32:09.7687399Z ) -> None: 2025-05-07T20:32:09.7687616Z torch.manual_seed(2025) 2025-05-07T20:32:09.7687860Z 2025-05-07T20:32:09.7688134Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.7688488Z 2025-05-07T20:32:09.7688677Z x_sign = torch.sign(x) 2025-05-07T20:32:09.7688969Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.7689282Z x = x_sign * x_clamp 2025-05-07T20:32:09.7689693Z x0 = x[:, :D] 2025-05-07T20:32:09.7689920Z x1 = x[:, D:] 2025-05-07T20:32:09.7690128Z 2025-05-07T20:32:09.7690311Z if contiguous: 2025-05-07T20:32:09.7690546Z x0 = x0.contiguous() 2025-05-07T20:32:09.7690809Z x1 = x1.contiguous() 2025-05-07T20:32:09.7691043Z 2025-05-07T20:32:09.7691236Z if scale_ub is not None: 2025-05-07T20:32:09.7691513Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.7691905Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.7692218Z ) 2025-05-07T20:32:09.7692410Z else: 2025-05-07T20:32:09.7692620Z scale_ub_tensor = None 2025-05-07T20:32:09.7692865Z 2025-05-07T20:32:09.7693095Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.7693420Z op = silu_mul_quant 2025-05-07T20:32:09.7693668Z if compiled: 2025-05-07T20:32:09.7693925Z op = torch.compile(op) 2025-05-07T20:32:09.7694236Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.7694512Z 2025-05-07T20:32:09.7694707Z y_fp8, y_scale = fn() 2025-05-07T20:32:09.7694995Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:09.7695288Z 2025-05-07T20:32:09.7695528Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.7695871Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:09.7696166Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:09.7696484Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:09.7696850Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:09.7697165Z 2025-05-07T20:32:09.7697363Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:09.7697694Z 2025-05-07T20:32:09.7697795Z moe/activation_test.py:126: 2025-05-07T20:32:09.7698097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.7698436Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:09.7698770Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:09.7699585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:09.7700366Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:09.7700922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.7701626Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.7702338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:09.7703087Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:09.7703842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:09.7704505Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:09.7705164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:09.7705707Z fn() 2025-05-07T20:32:09.7706378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:09.7706981Z self.fn.run( 2025-05-07T20:32:09.7707455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.7708003Z kernel = self.compile( 2025-05-07T20:32:09.7708556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.7709237Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.7709641Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.7709882Z 2025-05-07T20:32:09.7710216Z self = 2025-05-07T20:32:09.7711346Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.7712777Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96ce7c0720>} 2025-05-07T20:32:09.7714171Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.7715258Z context = 2025-05-07T20:32:09.7715589Z 2025-05-07T20:32:09.7715764Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.7716308Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.7716785Z module_map=module_map) 2025-05-07T20:32:09.7717155Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.7717517Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:09.7717784Z E ^ 2025-05-07T20:32:09.7718263Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.7718726Z 2025-05-07T20:32:09.7719158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.7719821Z 2025-05-07T20:32:09.7719927Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.7720350Z self=, 2025-05-07T20:32:09.7720766Z T=2048, 2025-05-07T20:32:09.7720954Z D=5120, 2025-05-07T20:32:09.7721149Z scale_ub=None, 2025-05-07T20:32:09.7721367Z contiguous=True, 2025-05-07T20:32:09.7721587Z compiled=True, 2025-05-07T20:32:09.7721792Z ) 2025-05-07T20:32:09.9898020Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:09.9899128Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:32:09.9900520Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:09.9902010Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:09.9903011Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.9904363Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:09.9905801Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.9906959Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.9908387Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:09.9909823Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.9910923Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.9912251Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:09.9913558Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:32:09.9914825Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:09.9916136Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:32:09.9916988Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.9918053Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:09.9919237Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:32:09.9920069Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^ 2025-05-07T20:32:09.9921320Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:09.9922656Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:09.9923820Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:09.9924913Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:32:09.9926196Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:09.9927611Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:09.9928724Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.9929674Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.9930447Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:32:09.9931908Z W0507 20:32:09.986000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.0583590Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:10.0584681Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:32:10.0586845Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:10.0589527Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:10.0591393Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.0593852Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:10.0596031Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.0597055Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.0598493Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:10.0599927Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.0601022Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.0602352Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:10.0603656Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:32:10.0604929Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:10.0606319Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:32:10.0607179Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.0608242Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:10.0609297Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:32:10.0610125Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^ 2025-05-07T20:32:10.0611501Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:10.0612874Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:10.0614035Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:10.0615116Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:32:10.0616353Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:10.0617763Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:10.0618871Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.0619816Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.0620583Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:32:10.0621641Z W0507 20:32:10.055000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.3519336Z self = 2025-05-07T20:32:10.3519906Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:10.3520193Z 2025-05-07T20:32:10.3520274Z @given( 2025-05-07T20:32:10.3520507Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.3520825Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.3521131Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.3521462Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.3521795Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.3522076Z ) 2025-05-07T20:32:10.3522432Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.3522886Z def test_silu_mul_quant( 2025-05-07T20:32:10.3523135Z self, 2025-05-07T20:32:10.3523335Z T: int, 2025-05-07T20:32:10.3523532Z D: int, 2025-05-07T20:32:10.3523748Z scale_ub: Optional[float], 2025-05-07T20:32:10.3524024Z contiguous: bool, 2025-05-07T20:32:10.3524275Z compiled: bool, 2025-05-07T20:32:10.3524506Z ) -> None: 2025-05-07T20:32:10.3524727Z torch.manual_seed(2025) 2025-05-07T20:32:10.3524981Z 2025-05-07T20:32:10.3525266Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.3525616Z 2025-05-07T20:32:10.3525815Z x_sign = torch.sign(x) 2025-05-07T20:32:10.3526114Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.3526430Z x = x_sign * x_clamp 2025-05-07T20:32:10.3526678Z x0 = x[:, :D] 2025-05-07T20:32:10.3526903Z x1 = x[:, D:] 2025-05-07T20:32:10.3527111Z 2025-05-07T20:32:10.3527306Z if contiguous: 2025-05-07T20:32:10.3527546Z x0 = x0.contiguous() 2025-05-07T20:32:10.3527813Z x1 = x1.contiguous() 2025-05-07T20:32:10.3528064Z 2025-05-07T20:32:10.3528264Z if scale_ub is not None: 2025-05-07T20:32:10.3528540Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.3529051Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.3529376Z ) 2025-05-07T20:32:10.3529573Z else: 2025-05-07T20:32:10.3529795Z scale_ub_tensor = None 2025-05-07T20:32:10.3530055Z 2025-05-07T20:32:10.3530296Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.3530616Z op = silu_mul_quant 2025-05-07T20:32:10.3530875Z if compiled: 2025-05-07T20:32:10.3531133Z op = torch.compile(op) 2025-05-07T20:32:10.3531438Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.3531721Z 2025-05-07T20:32:10.3531975Z y_fp8, y_scale = fn() 2025-05-07T20:32:10.3532267Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:10.3532577Z 2025-05-07T20:32:10.3532825Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.3533164Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:10.3533472Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:10.3533800Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:10.3534167Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:10.3534487Z 2025-05-07T20:32:10.3534698Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:10.3534898Z 2025-05-07T20:32:10.3535010Z moe/activation_test.py:126: 2025-05-07T20:32:10.3535312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.3535662Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:10.3536008Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:10.3536821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:10.3537734Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:10.3538307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.3539017Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.3539728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:10.3540477Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:10.3541231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:10.3541893Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:10.3542508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:10.3543050Z fn() 2025-05-07T20:32:10.3543574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:10.3544174Z self.fn.run( 2025-05-07T20:32:10.3544661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.3545207Z kernel = self.compile( 2025-05-07T20:32:10.3545769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.3546441Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.3546853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.3547092Z 2025-05-07T20:32:10.3547318Z self = 2025-05-07T20:32:10.3548442Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.3549947Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96cf3f05e0>} 2025-05-07T20:32:10.3551360Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.3552449Z context = 2025-05-07T20:32:10.3552748Z 2025-05-07T20:32:10.3552923Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.3553466Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.3553955Z module_map=module_map) 2025-05-07T20:32:10.3554342Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.3554712Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:10.3554986Z E ^ 2025-05-07T20:32:10.3555478Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.3555957Z 2025-05-07T20:32:10.3556448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.3556991Z 2025-05-07T20:32:10.3557102Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.3557533Z self=, 2025-05-07T20:32:10.3557953Z T=128, 2025-05-07T20:32:10.3558154Z D=5120, 2025-05-07T20:32:10.3558348Z scale_ub=None, 2025-05-07T20:32:10.3558575Z contiguous=True, 2025-05-07T20:32:10.3558806Z compiled=True, 2025-05-07T20:32:10.3559011Z ) 2025-05-07T20:32:10.5902831Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:10.5903944Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:32:10.5905327Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:10.5906995Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:10.5908002Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.5909360Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:10.5910793Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.5911804Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.5919007Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:10.5920510Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.5921793Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.5923120Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:10.5924421Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:32:10.5925693Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:10.5926949Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:32:10.5927816Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.5928880Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:10.5929936Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:32:10.5930754Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^ 2025-05-07T20:32:10.5932074Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:10.5933538Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:10.5934703Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:10.5935789Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:32:10.5937005Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:10.5938414Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:10.5939522Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.5940470Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.5941237Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:32:10.5942291Z W0507 20:32:10.587000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.6596149Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:10.6597258Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:32:10.6598784Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:10.6600253Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:10.6601260Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.6602598Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:10.6604042Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.6605059Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.6606483Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:10.6607919Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.6609171Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.6610497Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:10.6611833Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:32:10.6613103Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:10.6614358Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:32:10.6615218Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.6616285Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:10.6617340Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:32:10.6618167Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^ 2025-05-07T20:32:10.6619425Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:10.6620762Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:10.6622046Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:10.6623133Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:32:10.6624360Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:10.6625777Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:10.6626884Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.6627835Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.6628605Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:32:10.6629664Z W0507 20:32:10.656000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.9957398Z self = 2025-05-07T20:32:10.9957944Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:10.9958223Z 2025-05-07T20:32:10.9958303Z @given( 2025-05-07T20:32:10.9958701Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.9959016Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.9959325Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.9959662Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.9959994Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.9960272Z ) 2025-05-07T20:32:10.9960626Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.9961077Z def test_silu_mul_quant( 2025-05-07T20:32:10.9961311Z self, 2025-05-07T20:32:10.9961503Z T: int, 2025-05-07T20:32:10.9961703Z D: int, 2025-05-07T20:32:10.9961914Z scale_ub: Optional[float], 2025-05-07T20:32:10.9962184Z contiguous: bool, 2025-05-07T20:32:10.9962426Z compiled: bool, 2025-05-07T20:32:10.9962650Z ) -> None: 2025-05-07T20:32:10.9962870Z torch.manual_seed(2025) 2025-05-07T20:32:10.9963117Z 2025-05-07T20:32:10.9963395Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.9963750Z 2025-05-07T20:32:10.9963945Z x_sign = torch.sign(x) 2025-05-07T20:32:10.9964248Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.9964560Z x = x_sign * x_clamp 2025-05-07T20:32:10.9964805Z x0 = x[:, :D] 2025-05-07T20:32:10.9965026Z x1 = x[:, D:] 2025-05-07T20:32:10.9965232Z 2025-05-07T20:32:10.9965421Z if contiguous: 2025-05-07T20:32:10.9965659Z x0 = x0.contiguous() 2025-05-07T20:32:10.9965916Z x1 = x1.contiguous() 2025-05-07T20:32:10.9966162Z 2025-05-07T20:32:10.9966355Z if scale_ub is not None: 2025-05-07T20:32:10.9966626Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.9966971Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.9967288Z ) 2025-05-07T20:32:10.9967480Z else: 2025-05-07T20:32:10.9967701Z scale_ub_tensor = None 2025-05-07T20:32:10.9967961Z 2025-05-07T20:32:10.9968198Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.9968517Z op = silu_mul_quant 2025-05-07T20:32:10.9968902Z if compiled: 2025-05-07T20:32:10.9969152Z op = torch.compile(op) 2025-05-07T20:32:10.9969459Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.9969739Z 2025-05-07T20:32:10.9969936Z y_fp8, y_scale = fn() 2025-05-07T20:32:10.9970223Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:10.9970523Z 2025-05-07T20:32:10.9970765Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.9971103Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:10.9971405Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:10.9971729Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:10.9972145Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:10.9972467Z 2025-05-07T20:32:10.9972671Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:10.9972869Z 2025-05-07T20:32:10.9972978Z moe/activation_test.py:126: 2025-05-07T20:32:10.9973285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.9973630Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:10.9973965Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:10.9974775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:10.9975558Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:10.9976117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.9976824Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.9977533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:10.9978365Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:10.9979126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:10.9979781Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:10.9980399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:10.9980929Z fn() 2025-05-07T20:32:10.9981449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:10.9982046Z self.fn.run( 2025-05-07T20:32:10.9982524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.9983070Z kernel = self.compile( 2025-05-07T20:32:10.9983626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.9984297Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.9984710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.9984947Z 2025-05-07T20:32:10.9985163Z self = 2025-05-07T20:32:10.9986334Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.9987758Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9639a92520>} 2025-05-07T20:32:10.9989157Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.9990323Z context = 2025-05-07T20:32:10.9990624Z 2025-05-07T20:32:10.9990801Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.9991339Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.9991825Z module_map=module_map) 2025-05-07T20:32:10.9992199Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.9992564Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:10.9992839Z E ^ 2025-05-07T20:32:10.9993320Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.9993798Z 2025-05-07T20:32:10.9994240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.9994786Z 2025-05-07T20:32:10.9994891Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.9995327Z self=, 2025-05-07T20:32:10.9995748Z T=4096, 2025-05-07T20:32:10.9995936Z D=5120, 2025-05-07T20:32:10.9996131Z scale_ub=None, 2025-05-07T20:32:10.9996352Z contiguous=True, 2025-05-07T20:32:10.9996574Z compiled=True, 2025-05-07T20:32:10.9996778Z ) 2025-05-07T20:32:11.2343313Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:11.2344429Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:11.2345829Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:11.2347508Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:11.2348522Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:11.2349880Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:11.2351328Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.2352365Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:11.2353643Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:11.2355087Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.2356253Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:11.2357592Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:11.2359016Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:11.2360294Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:11.2361549Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:11.2362410Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:11.2363476Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:11.2364550Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:11.2365368Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^ 2025-05-07T20:32:11.2366628Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:11.2367966Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:11.2369132Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:11.2370312Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:11.2371536Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:11.2373007Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:11.2374111Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.2375060Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.2375831Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:11.2376895Z W0507 20:32:11.231000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.3036805Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:11.3037895Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:11.3039280Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:11.3040916Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:11.3041933Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:11.3043277Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:11.3044715Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.3045728Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:11.3047058Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:11.3048488Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.3049580Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:11.3050908Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:11.3052369Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:11.3053639Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:11.3054894Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:11.3055745Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:11.3056802Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:11.3057868Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:11.3058690Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^ 2025-05-07T20:32:11.3059944Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:11.3061269Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:11.3062424Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:11.3063508Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:11.3064808Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:11.3066213Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:11.3067311Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.3068251Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.3069012Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:11.3070072Z W0507 20:32:11.300000 97296 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6378203Z self = 2025-05-07T20:32:11.6378755Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:11.6379038Z 2025-05-07T20:32:11.6379116Z @given( 2025-05-07T20:32:11.6379347Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6383734Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6384045Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6384380Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6384707Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6384998Z ) 2025-05-07T20:32:11.6385352Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6385895Z def test_silu_mul_quant( 2025-05-07T20:32:11.6386135Z self, 2025-05-07T20:32:11.6386328Z T: int, 2025-05-07T20:32:11.6386524Z D: int, 2025-05-07T20:32:11.6386744Z scale_ub: Optional[float], 2025-05-07T20:32:11.6387015Z contiguous: bool, 2025-05-07T20:32:11.6387259Z compiled: bool, 2025-05-07T20:32:11.6387485Z ) -> None: 2025-05-07T20:32:11.6387706Z torch.manual_seed(2025) 2025-05-07T20:32:11.6387951Z 2025-05-07T20:32:11.6388227Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6388598Z 2025-05-07T20:32:11.6388792Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6389087Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6389405Z x = x_sign * x_clamp 2025-05-07T20:32:11.6389647Z x0 = x[:, :D] 2025-05-07T20:32:11.6389869Z x1 = x[:, D:] 2025-05-07T20:32:11.6390083Z 2025-05-07T20:32:11.6390270Z if contiguous: 2025-05-07T20:32:11.6390512Z x0 = x0.contiguous() 2025-05-07T20:32:11.6390775Z x1 = x1.contiguous() 2025-05-07T20:32:11.6391014Z 2025-05-07T20:32:11.6391214Z if scale_ub is not None: 2025-05-07T20:32:11.6391492Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6391831Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6392149Z ) 2025-05-07T20:32:11.6392346Z else: 2025-05-07T20:32:11.6392556Z scale_ub_tensor = None 2025-05-07T20:32:11.6392811Z 2025-05-07T20:32:11.6393054Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6393379Z op = silu_mul_quant 2025-05-07T20:32:11.6393629Z if compiled: 2025-05-07T20:32:11.6393888Z op = torch.compile(op) 2025-05-07T20:32:11.6394198Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6394474Z 2025-05-07T20:32:11.6394674Z y_fp8, y_scale = fn() 2025-05-07T20:32:11.6394971Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:11.6395264Z 2025-05-07T20:32:11.6395514Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6396052Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:11.6396352Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:11.6396677Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:11.6397043Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.6397355Z 2025-05-07T20:32:11.6397561Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:11.6397769Z 2025-05-07T20:32:11.6397872Z moe/activation_test.py:126: 2025-05-07T20:32:11.6398183Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6398522Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:11.6398857Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.6399675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:11.6400454Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:11.6401017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6401723Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6402435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:11.6403268Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:11.6404020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:11.6404678Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:11.6405341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:11.6405898Z fn() 2025-05-07T20:32:11.6406714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:11.6407322Z self.fn.run( 2025-05-07T20:32:11.6407798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6408347Z kernel = self.compile( 2025-05-07T20:32:11.6408906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6409583Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6409988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6410226Z 2025-05-07T20:32:11.6410439Z self = 2025-05-07T20:32:11.6411570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6413044Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9639a55300>} 2025-05-07T20:32:11.6414453Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6415514Z context = 2025-05-07T20:32:11.6415818Z 2025-05-07T20:32:11.6415991Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6416581Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6417068Z module_map=module_map) 2025-05-07T20:32:11.6417443Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6417933Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:11.6418211Z E ^ 2025-05-07T20:32:11.6418686Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6419156Z 2025-05-07T20:32:11.6419592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6420134Z 2025-05-07T20:32:11.6420240Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6420666Z self=, 2025-05-07T20:32:11.6421081Z T=16384, 2025-05-07T20:32:11.6421282Z D=5120, 2025-05-07T20:32:11.6421485Z scale_ub=None, 2025-05-07T20:32:11.6421703Z contiguous=True, 2025-05-07T20:32:11.6421931Z compiled=True, 2025-05-07T20:32:11.6422142Z ) 2025-05-07T20:32:11.6673961Z W0507 20:32:11.665000 97296 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:11.6676139Z W0507 20:32:11.665000 97296 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:11.6677524Z W0507 20:32:11.665000 97296 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:11.6678659Z W0507 20:32:11.665000 97296 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:11.6679806Z W0507 20:32:11.665000 97296 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:11.7544393Z self = 2025-05-07T20:32:11.7545491Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:11.7545847Z 2025-05-07T20:32:11.7545946Z @given( 2025-05-07T20:32:11.7546209Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.7546528Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.7546836Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.7547177Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.7547517Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.7547807Z ) 2025-05-07T20:32:11.7548166Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.7548625Z def test_silu_mul_quant( 2025-05-07T20:32:11.7548871Z self, 2025-05-07T20:32:11.7549072Z T: int, 2025-05-07T20:32:11.7549275Z D: int, 2025-05-07T20:32:11.7549495Z scale_ub: Optional[float], 2025-05-07T20:32:11.7549772Z contiguous: bool, 2025-05-07T20:32:11.7550024Z compiled: bool, 2025-05-07T20:32:11.7550251Z ) -> None: 2025-05-07T20:32:11.7550479Z torch.manual_seed(2025) 2025-05-07T20:32:11.7550726Z 2025-05-07T20:32:11.7551005Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.7551350Z 2025-05-07T20:32:11.7551548Z x_sign = torch.sign(x) 2025-05-07T20:32:11.7551848Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.7552165Z x = x_sign * x_clamp 2025-05-07T20:32:11.7552411Z x0 = x[:, :D] 2025-05-07T20:32:11.7552632Z x1 = x[:, D:] 2025-05-07T20:32:11.7552839Z 2025-05-07T20:32:11.7553029Z if contiguous: 2025-05-07T20:32:11.7553265Z x0 = x0.contiguous() 2025-05-07T20:32:11.7553528Z x1 = x1.contiguous() 2025-05-07T20:32:11.7553775Z 2025-05-07T20:32:11.7553971Z if scale_ub is not None: 2025-05-07T20:32:11.7554243Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.7554733Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.7555052Z ) 2025-05-07T20:32:11.7555241Z else: 2025-05-07T20:32:11.7555459Z scale_ub_tensor = None 2025-05-07T20:32:11.7555715Z 2025-05-07T20:32:11.7555947Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.7556271Z op = silu_mul_quant 2025-05-07T20:32:11.7556526Z if compiled: 2025-05-07T20:32:11.7556783Z op = torch.compile(op) 2025-05-07T20:32:11.7557084Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.7557364Z 2025-05-07T20:32:11.7557561Z y_fp8, y_scale = fn() 2025-05-07T20:32:11.7557850Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:11.7558152Z 2025-05-07T20:32:11.7558397Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.7558734Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:11.7565458Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:11.7565888Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:11.7566263Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.7566583Z 2025-05-07T20:32:11.7566788Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:11.7567001Z 2025-05-07T20:32:11.7567106Z moe/activation_test.py:126: 2025-05-07T20:32:11.7567418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.7567873Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:11.7568211Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.7569032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:11.7569880Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:11.7570440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.7571152Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.7571934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:11.7572678Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:11.7573438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:11.7574104Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:11.7574731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:11.7575270Z fn() 2025-05-07T20:32:11.7575800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:11.7576400Z self.fn.run( 2025-05-07T20:32:11.7576885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.7577431Z kernel = self.compile( 2025-05-07T20:32:11.7577982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.7578657Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.7579074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.7579312Z 2025-05-07T20:32:11.7579525Z self = 2025-05-07T20:32:11.7580651Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.7582161Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96391f8e00>} 2025-05-07T20:32:11.7583551Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.7584616Z context = 2025-05-07T20:32:11.7584916Z 2025-05-07T20:32:11.7585086Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.7585636Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.7586121Z module_map=module_map) 2025-05-07T20:32:11.7586497Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.7586866Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:11.7587139Z E ^ 2025-05-07T20:32:11.7587628Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.7588094Z 2025-05-07T20:32:11.7588527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.7589065Z 2025-05-07T20:32:11.7589170Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.7589594Z self=, 2025-05-07T20:32:11.7590061Z T=1, 2025-05-07T20:32:11.7590248Z D=5120, 2025-05-07T20:32:11.7590445Z scale_ub=1200.0, 2025-05-07T20:32:11.7590673Z contiguous=True, 2025-05-07T20:32:11.7590894Z compiled=True, 2025-05-07T20:32:11.7591103Z ) 2025-05-07T20:32:11.8948486Z self = 2025-05-07T20:32:11.8949187Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:11.8949458Z 2025-05-07T20:32:11.8949541Z @given( 2025-05-07T20:32:11.8949785Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.8950104Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.8950410Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.8950745Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.8951079Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.8951362Z ) 2025-05-07T20:32:11.8951726Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.8952181Z def test_silu_mul_quant( 2025-05-07T20:32:11.8952423Z self, 2025-05-07T20:32:11.8952626Z T: int, 2025-05-07T20:32:11.8952827Z D: int, 2025-05-07T20:32:11.8953054Z scale_ub: Optional[float], 2025-05-07T20:32:11.8953331Z contiguous: bool, 2025-05-07T20:32:11.8953577Z compiled: bool, 2025-05-07T20:32:11.8953804Z ) -> None: 2025-05-07T20:32:11.8954022Z torch.manual_seed(2025) 2025-05-07T20:32:11.8954270Z 2025-05-07T20:32:11.8954554Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.8954904Z 2025-05-07T20:32:11.8955102Z x_sign = torch.sign(x) 2025-05-07T20:32:11.8955402Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.8955715Z x = x_sign * x_clamp 2025-05-07T20:32:11.8955965Z x0 = x[:, :D] 2025-05-07T20:32:11.8956231Z x1 = x[:, D:] 2025-05-07T20:32:11.8956443Z 2025-05-07T20:32:11.8956634Z if contiguous: 2025-05-07T20:32:11.8956871Z x0 = x0.contiguous() 2025-05-07T20:32:11.8957129Z x1 = x1.contiguous() 2025-05-07T20:32:11.8957376Z 2025-05-07T20:32:11.8957572Z if scale_ub is not None: 2025-05-07T20:32:11.8957847Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.8958192Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.8958508Z ) 2025-05-07T20:32:11.8958704Z else: 2025-05-07T20:32:11.8959039Z scale_ub_tensor = None 2025-05-07T20:32:11.8959297Z 2025-05-07T20:32:11.8959533Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.8959853Z op = silu_mul_quant 2025-05-07T20:32:11.8960124Z if compiled: 2025-05-07T20:32:11.8960373Z op = torch.compile(op) 2025-05-07T20:32:11.8960677Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.8960956Z 2025-05-07T20:32:11.8961152Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.8961319Z 2025-05-07T20:32:11.8961423Z moe/activation_test.py:117: 2025-05-07T20:32:11.8961724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.8962066Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.8962358Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.8962928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.8963510Z return fn(*args, **kwargs) 2025-05-07T20:32:11.8964188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.8964899Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.8965447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.8966223Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.8966907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.8967449Z kernel = self.compile( 2025-05-07T20:32:11.8968010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.8968731Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.8969145Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.8969382Z 2025-05-07T20:32:11.8969591Z self = 2025-05-07T20:32:11.8970710Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.8972181Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96394cccc0>} 2025-05-07T20:32:11.8973570Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.8974627Z context = 2025-05-07T20:32:11.8974924Z 2025-05-07T20:32:11.8975099Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.8975645Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.8976171Z module_map=module_map) 2025-05-07T20:32:11.8976540Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.8976903Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.8977169Z E ^ 2025-05-07T20:32:11.8977645Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.8978109Z 2025-05-07T20:32:11.8978542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.8979081Z 2025-05-07T20:32:11.8979186Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.8979612Z self=, 2025-05-07T20:32:11.8980105Z T=1, 2025-05-07T20:32:11.8980291Z D=5120, 2025-05-07T20:32:11.8980484Z scale_ub=None, 2025-05-07T20:32:11.8980701Z contiguous=False, 2025-05-07T20:32:11.8980924Z compiled=True, 2025-05-07T20:32:11.8981128Z ) 2025-05-07T20:32:11.9592115Z self = 2025-05-07T20:32:11.9592646Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:11.9592921Z 2025-05-07T20:32:11.9592995Z @given( 2025-05-07T20:32:11.9593224Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.9593537Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.9593835Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.9594171Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.9594500Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.9594788Z ) 2025-05-07T20:32:11.9595140Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.9595586Z def test_silu_mul_quant( 2025-05-07T20:32:11.9595829Z self, 2025-05-07T20:32:11.9596019Z T: int, 2025-05-07T20:32:11.9596212Z D: int, 2025-05-07T20:32:11.9596430Z scale_ub: Optional[float], 2025-05-07T20:32:11.9596697Z contiguous: bool, 2025-05-07T20:32:11.9596939Z compiled: bool, 2025-05-07T20:32:11.9597271Z ) -> None: 2025-05-07T20:32:11.9597486Z torch.manual_seed(2025) 2025-05-07T20:32:11.9597729Z 2025-05-07T20:32:11.9598010Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.9598348Z 2025-05-07T20:32:11.9598540Z x_sign = torch.sign(x) 2025-05-07T20:32:11.9598836Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.9599221Z x = x_sign * x_clamp 2025-05-07T20:32:11.9599466Z x0 = x[:, :D] 2025-05-07T20:32:11.9599684Z x1 = x[:, D:] 2025-05-07T20:32:11.9599890Z 2025-05-07T20:32:11.9600077Z if contiguous: 2025-05-07T20:32:11.9600314Z x0 = x0.contiguous() 2025-05-07T20:32:11.9600575Z x1 = x1.contiguous() 2025-05-07T20:32:11.9600812Z 2025-05-07T20:32:11.9601002Z if scale_ub is not None: 2025-05-07T20:32:11.9601276Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.9601612Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.9601929Z ) 2025-05-07T20:32:11.9602131Z else: 2025-05-07T20:32:11.9602342Z scale_ub_tensor = None 2025-05-07T20:32:11.9602600Z 2025-05-07T20:32:11.9602837Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.9603154Z op = silu_mul_quant 2025-05-07T20:32:11.9603416Z if compiled: 2025-05-07T20:32:11.9603666Z op = torch.compile(op) 2025-05-07T20:32:11.9603964Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.9604246Z 2025-05-07T20:32:11.9604447Z y_fp8, y_scale = fn() 2025-05-07T20:32:11.9604737Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:11.9605033Z 2025-05-07T20:32:11.9605277Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.9605624Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:11.9605968Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:11.9606564Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:11.9606938Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.9607259Z 2025-05-07T20:32:11.9607459Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:11.9607665Z 2025-05-07T20:32:11.9607767Z moe/activation_test.py:126: 2025-05-07T20:32:11.9608076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.9608414Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:11.9608751Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.9609721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:11.9610502Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:11.9611059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.9611815Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.9612533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:11.9613280Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:11.9614030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:11.9614687Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:11.9615314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:11.9615871Z fn() 2025-05-07T20:32:11.9616415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:11.9617020Z self.fn.run( 2025-05-07T20:32:11.9617500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.9618115Z kernel = self.compile( 2025-05-07T20:32:11.9618670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.9619349Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.9619818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.9620061Z 2025-05-07T20:32:11.9620272Z self = 2025-05-07T20:32:11.9621394Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.9622815Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96394cf2e0>} 2025-05-07T20:32:11.9624204Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.9625263Z context = 2025-05-07T20:32:11.9625564Z 2025-05-07T20:32:11.9625736Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.9626285Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.9626816Z module_map=module_map) 2025-05-07T20:32:11.9627186Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.9627552Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:11.9627822Z E ^ 2025-05-07T20:32:11.9628309Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.9628786Z 2025-05-07T20:32:11.9629221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.9629761Z 2025-05-07T20:32:11.9629864Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.9630292Z self=, 2025-05-07T20:32:11.9630700Z T=1, 2025-05-07T20:32:11.9630884Z D=5120, 2025-05-07T20:32:11.9631078Z scale_ub=None, 2025-05-07T20:32:11.9631295Z contiguous=True, 2025-05-07T20:32:11.9631601Z compiled=False, 2025-05-07T20:32:11.9631814Z ) 2025-05-07T20:32:12.1118457Z self = 2025-05-07T20:32:12.1118997Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:12.1119267Z 2025-05-07T20:32:12.1119354Z @given( 2025-05-07T20:32:12.1119581Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.1119909Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.1120223Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.1120551Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.1120881Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.1121166Z ) 2025-05-07T20:32:12.1121517Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.1121970Z def test_silu_mul_quant( 2025-05-07T20:32:12.1122214Z self, 2025-05-07T20:32:12.1122422Z T: int, 2025-05-07T20:32:12.1122618Z D: int, 2025-05-07T20:32:12.1122841Z scale_ub: Optional[float], 2025-05-07T20:32:12.1123116Z contiguous: bool, 2025-05-07T20:32:12.1123355Z compiled: bool, 2025-05-07T20:32:12.1123584Z ) -> None: 2025-05-07T20:32:12.1123804Z torch.manual_seed(2025) 2025-05-07T20:32:12.1124047Z 2025-05-07T20:32:12.1124323Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.1124783Z 2025-05-07T20:32:12.1124974Z x_sign = torch.sign(x) 2025-05-07T20:32:12.1125273Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.1125590Z x = x_sign * x_clamp 2025-05-07T20:32:12.1125831Z x0 = x[:, :D] 2025-05-07T20:32:12.1126119Z x1 = x[:, D:] 2025-05-07T20:32:12.1126332Z 2025-05-07T20:32:12.1126516Z if contiguous: 2025-05-07T20:32:12.1126752Z x0 = x0.contiguous() 2025-05-07T20:32:12.1127019Z x1 = x1.contiguous() 2025-05-07T20:32:12.1127264Z 2025-05-07T20:32:12.1127460Z if scale_ub is not None: 2025-05-07T20:32:12.1127742Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.1128084Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.1128395Z ) 2025-05-07T20:32:12.1128587Z else: 2025-05-07T20:32:12.1128800Z scale_ub_tensor = None 2025-05-07T20:32:12.1129055Z 2025-05-07T20:32:12.1129292Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.1129614Z op = silu_mul_quant 2025-05-07T20:32:12.1129869Z if compiled: 2025-05-07T20:32:12.1130122Z op = torch.compile(op) 2025-05-07T20:32:12.1130426Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.1130706Z 2025-05-07T20:32:12.1130909Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.1131077Z 2025-05-07T20:32:12.1131179Z moe/activation_test.py:117: 2025-05-07T20:32:12.1131482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.1131882Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.1132169Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.1132885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.1133592Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.1134149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.1134854Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.1135543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.1136093Z kernel = self.compile( 2025-05-07T20:32:12.1136652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.1137452Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.1137863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.1138106Z 2025-05-07T20:32:12.1138316Z self = 2025-05-07T20:32:12.1139435Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.1140856Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96394cfb00>} 2025-05-07T20:32:12.1142276Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.1143333Z context = 2025-05-07T20:32:12.1143628Z 2025-05-07T20:32:12.1143798Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.1144337Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.1144818Z module_map=module_map) 2025-05-07T20:32:12.1145244Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.1145603Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.1145868Z E ^ 2025-05-07T20:32:12.1146350Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.1146858Z 2025-05-07T20:32:12.1147289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.1147827Z 2025-05-07T20:32:12.1147938Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.1148364Z self=, 2025-05-07T20:32:12.1148780Z T=128, 2025-05-07T20:32:12.1148967Z D=5120, 2025-05-07T20:32:12.1149163Z scale_ub=None, 2025-05-07T20:32:12.1149386Z contiguous=False, 2025-05-07T20:32:12.1149616Z compiled=True, 2025-05-07T20:32:12.1149830Z ) 2025-05-07T20:32:12.1150159Z self = 2025-05-07T20:32:12.1150662Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:12.1150947Z 2025-05-07T20:32:12.1151028Z @given( 2025-05-07T20:32:12.1151267Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.1151586Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.1151905Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.1152241Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.1152589Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.1152877Z ) 2025-05-07T20:32:12.1153234Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.1153688Z def test_silu_mul_quant( 2025-05-07T20:32:12.1153932Z self, 2025-05-07T20:32:12.1154134Z T: int, 2025-05-07T20:32:12.1154336Z D: int, 2025-05-07T20:32:12.1154558Z scale_ub: Optional[float], 2025-05-07T20:32:12.1154839Z contiguous: bool, 2025-05-07T20:32:12.1155086Z compiled: bool, 2025-05-07T20:32:12.1155311Z ) -> None: 2025-05-07T20:32:12.1155534Z torch.manual_seed(2025) 2025-05-07T20:32:12.1155779Z 2025-05-07T20:32:12.1156082Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.1156462Z 2025-05-07T20:32:12.1156663Z x_sign = torch.sign(x) 2025-05-07T20:32:12.1156957Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.1157358Z x = x_sign * x_clamp 2025-05-07T20:32:12.1157604Z x0 = x[:, :D] 2025-05-07T20:32:12.1157825Z x1 = x[:, D:] 2025-05-07T20:32:12.1158034Z 2025-05-07T20:32:12.1158225Z if contiguous: 2025-05-07T20:32:12.1158461Z x0 = x0.contiguous() 2025-05-07T20:32:12.1158721Z x1 = x1.contiguous() 2025-05-07T20:32:12.1158963Z 2025-05-07T20:32:12.1159160Z if scale_ub is not None: 2025-05-07T20:32:12.1159436Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.1159776Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.1160091Z ) 2025-05-07T20:32:12.1160285Z else: 2025-05-07T20:32:12.1160504Z scale_ub_tensor = None 2025-05-07T20:32:12.1160762Z 2025-05-07T20:32:12.1160993Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.1161313Z op = silu_mul_quant 2025-05-07T20:32:12.1161569Z if compiled: 2025-05-07T20:32:12.1161821Z op = torch.compile(op) 2025-05-07T20:32:12.1162123Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.1162411Z 2025-05-07T20:32:12.1162607Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.1162774Z 2025-05-07T20:32:12.1162875Z moe/activation_test.py:117: 2025-05-07T20:32:12.1163178Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.1163575Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.1163861Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.1164438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.1165016Z return fn(*args, **kwargs) 2025-05-07T20:32:12.1165696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.1166469Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.1167028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.1167736Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.1168430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.1168989Z kernel = self.compile( 2025-05-07T20:32:12.1169548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.1176527Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.1176949Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.1177197Z 2025-05-07T20:32:12.1177410Z self = 2025-05-07T20:32:12.1178535Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.1179947Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f963959c360>} 2025-05-07T20:32:12.1181337Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.1182409Z context = 2025-05-07T20:32:12.1182715Z 2025-05-07T20:32:12.1182888Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.1183434Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.1183917Z module_map=module_map) 2025-05-07T20:32:12.1184401Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.1184767Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.1185030Z E ^ 2025-05-07T20:32:12.1185515Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.1186040Z 2025-05-07T20:32:12.1186473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.1187012Z 2025-05-07T20:32:12.1187122Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.1187550Z self=, 2025-05-07T20:32:12.1187965Z T=128, 2025-05-07T20:32:12.1188163Z D=7168, 2025-05-07T20:32:12.1188365Z scale_ub=1200.0, 2025-05-07T20:32:12.1188596Z contiguous=False, 2025-05-07T20:32:12.1188828Z compiled=False, 2025-05-07T20:32:12.1189036Z ) 2025-05-07T20:32:12.2324217Z self = 2025-05-07T20:32:12.2324988Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:12.2325383Z 2025-05-07T20:32:12.2325502Z @given( 2025-05-07T20:32:12.2325741Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.2326064Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.2326385Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.2327205Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.2327673Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.2328054Z ) 2025-05-07T20:32:12.2328444Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.2329020Z def test_silu_mul_quant( 2025-05-07T20:32:12.2329276Z self, 2025-05-07T20:32:12.2329485Z T: int, 2025-05-07T20:32:12.2329685Z D: int, 2025-05-07T20:32:12.2329912Z scale_ub: Optional[float], 2025-05-07T20:32:12.2330202Z contiguous: bool, 2025-05-07T20:32:12.2330450Z compiled: bool, 2025-05-07T20:32:12.2330685Z ) -> None: 2025-05-07T20:32:12.2330915Z torch.manual_seed(2025) 2025-05-07T20:32:12.2331162Z 2025-05-07T20:32:12.2331450Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.2331884Z 2025-05-07T20:32:12.2332083Z x_sign = torch.sign(x) 2025-05-07T20:32:12.2332396Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.2332721Z x = x_sign * x_clamp 2025-05-07T20:32:12.2332968Z x0 = x[:, :D] 2025-05-07T20:32:12.2333198Z x1 = x[:, D:] 2025-05-07T20:32:12.2333416Z 2025-05-07T20:32:12.2333603Z if contiguous: 2025-05-07T20:32:12.2333853Z x0 = x0.contiguous() 2025-05-07T20:32:12.2334124Z x1 = x1.contiguous() 2025-05-07T20:32:12.2334376Z 2025-05-07T20:32:12.2334566Z if scale_ub is not None: 2025-05-07T20:32:12.2334859Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.2335211Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.2335521Z ) 2025-05-07T20:32:12.2335726Z else: 2025-05-07T20:32:12.2335942Z scale_ub_tensor = None 2025-05-07T20:32:12.2336194Z 2025-05-07T20:32:12.2336436Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.2336767Z op = silu_mul_quant 2025-05-07T20:32:12.2337019Z if compiled: 2025-05-07T20:32:12.2337276Z op = torch.compile(op) 2025-05-07T20:32:12.2337585Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.2337928Z 2025-05-07T20:32:12.2338315Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.2338549Z 2025-05-07T20:32:12.2338700Z moe/activation_test.py:117: 2025-05-07T20:32:12.2339074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.2339417Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.2339896Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.2340627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.2341340Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.2341903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.2342617Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.2343308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.2343910Z kernel = self.compile( 2025-05-07T20:32:12.2344478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.2345170Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.2345586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.2345838Z 2025-05-07T20:32:12.2346052Z self = 2025-05-07T20:32:12.2347180Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.2348758Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f963959cae0>} 2025-05-07T20:32:12.2350160Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.2351270Z context = 2025-05-07T20:32:12.2351587Z 2025-05-07T20:32:12.2351759Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.2352310Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.2352797Z module_map=module_map) 2025-05-07T20:32:12.2353172Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.2353547Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.2353819Z E ^ 2025-05-07T20:32:12.2354300Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.2354777Z 2025-05-07T20:32:12.2355209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.2355757Z 2025-05-07T20:32:12.2355865Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.2356302Z self=, 2025-05-07T20:32:12.2356717Z T=128, 2025-05-07T20:32:12.2356916Z D=5120, 2025-05-07T20:32:12.2357122Z scale_ub=None, 2025-05-07T20:32:12.2357341Z contiguous=False, 2025-05-07T20:32:12.2357579Z compiled=False, 2025-05-07T20:32:12.2357802Z ) 2025-05-07T20:32:12.2358127Z self = 2025-05-07T20:32:12.2358674Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:12.2358962Z 2025-05-07T20:32:12.2359042Z @given( 2025-05-07T20:32:12.2359286Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.2359620Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.2359931Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.2360277Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.2360621Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.2360914Z ) 2025-05-07T20:32:12.2361381Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.2361846Z def test_silu_mul_quant( 2025-05-07T20:32:12.2362090Z self, 2025-05-07T20:32:12.2362297Z T: int, 2025-05-07T20:32:12.2362509Z D: int, 2025-05-07T20:32:12.2362728Z scale_ub: Optional[float], 2025-05-07T20:32:12.2363010Z contiguous: bool, 2025-05-07T20:32:12.2363264Z compiled: bool, 2025-05-07T20:32:12.2363488Z ) -> None: 2025-05-07T20:32:12.2363712Z torch.manual_seed(2025) 2025-05-07T20:32:12.2363963Z 2025-05-07T20:32:12.2364241Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.2364595Z 2025-05-07T20:32:12.2364839Z x_sign = torch.sign(x) 2025-05-07T20:32:12.2365150Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.2365467Z x = x_sign * x_clamp 2025-05-07T20:32:12.2365724Z x0 = x[:, :D] 2025-05-07T20:32:12.2365978Z x1 = x[:, D:] 2025-05-07T20:32:12.2366219Z 2025-05-07T20:32:12.2366416Z if contiguous: 2025-05-07T20:32:12.2366660Z x0 = x0.contiguous() 2025-05-07T20:32:12.2366922Z x1 = x1.contiguous() 2025-05-07T20:32:12.2367178Z 2025-05-07T20:32:12.2367382Z if scale_ub is not None: 2025-05-07T20:32:12.2367659Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.2368075Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.2368398Z ) 2025-05-07T20:32:12.2368593Z else: 2025-05-07T20:32:12.2368816Z scale_ub_tensor = None 2025-05-07T20:32:12.2369079Z 2025-05-07T20:32:12.2369312Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.2369687Z op = silu_mul_quant 2025-05-07T20:32:12.2369953Z if compiled: 2025-05-07T20:32:12.2370209Z op = torch.compile(op) 2025-05-07T20:32:12.2370509Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.2370800Z 2025-05-07T20:32:12.2371002Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.2371170Z 2025-05-07T20:32:12.2371272Z moe/activation_test.py:117: 2025-05-07T20:32:12.2371583Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.2372053Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.2372342Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.2373067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.2373794Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.2374367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.2375083Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.2375786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.2376356Z kernel = self.compile( 2025-05-07T20:32:12.2376914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.2377605Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.2378026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.2378295Z 2025-05-07T20:32:12.2378512Z self = 2025-05-07T20:32:12.2379646Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.2381092Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f963959dd00>} 2025-05-07T20:32:12.2382579Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.2383663Z context = 2025-05-07T20:32:12.2383970Z 2025-05-07T20:32:12.2384146Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.2384707Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.2385196Z module_map=module_map) 2025-05-07T20:32:12.2385586Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.2385962Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.2386227Z E ^ 2025-05-07T20:32:12.2386719Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.2387204Z 2025-05-07T20:32:12.2387637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.2388172Z 2025-05-07T20:32:12.2388287Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.2388749Z self=, 2025-05-07T20:32:12.2389175Z T=128, 2025-05-07T20:32:12.2389430Z D=5120, 2025-05-07T20:32:12.2389629Z scale_ub=1200.0, 2025-05-07T20:32:12.2389866Z contiguous=True, 2025-05-07T20:32:12.2390095Z compiled=False, 2025-05-07T20:32:12.2390312Z ) 2025-05-07T20:32:12.4115396Z self = 2025-05-07T20:32:12.4116526Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:12.4116921Z 2025-05-07T20:32:12.4117032Z @given( 2025-05-07T20:32:12.4117353Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4117693Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4118007Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4118354Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4118693Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4118977Z ) 2025-05-07T20:32:12.4119344Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4119821Z def test_silu_mul_quant( 2025-05-07T20:32:12.4120075Z self, 2025-05-07T20:32:12.4120292Z T: int, 2025-05-07T20:32:12.4120504Z D: int, 2025-05-07T20:32:12.4120732Z scale_ub: Optional[float], 2025-05-07T20:32:12.4121022Z contiguous: bool, 2025-05-07T20:32:12.4121286Z compiled: bool, 2025-05-07T20:32:12.4121528Z ) -> None: 2025-05-07T20:32:12.4121785Z torch.manual_seed(2025) 2025-05-07T20:32:12.4122037Z 2025-05-07T20:32:12.4122339Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4122706Z 2025-05-07T20:32:12.4122913Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4123225Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4123557Z x = x_sign * x_clamp 2025-05-07T20:32:12.4123818Z x0 = x[:, :D] 2025-05-07T20:32:12.4124046Z x1 = x[:, D:] 2025-05-07T20:32:12.4124267Z 2025-05-07T20:32:12.4124472Z if contiguous: 2025-05-07T20:32:12.4124711Z x0 = x0.contiguous() 2025-05-07T20:32:12.4124987Z x1 = x1.contiguous() 2025-05-07T20:32:12.4125242Z 2025-05-07T20:32:12.4125438Z if scale_ub is not None: 2025-05-07T20:32:12.4125726Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4126133Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4126454Z ) 2025-05-07T20:32:12.4126663Z else: 2025-05-07T20:32:12.4126889Z scale_ub_tensor = None 2025-05-07T20:32:12.4127147Z 2025-05-07T20:32:12.4127583Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4127919Z op = silu_mul_quant 2025-05-07T20:32:12.4128179Z if compiled: 2025-05-07T20:32:12.4128446Z op = torch.compile(op) 2025-05-07T20:32:12.4128761Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4129055Z 2025-05-07T20:32:12.4129255Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4129439Z 2025-05-07T20:32:12.4129546Z moe/activation_test.py:117: 2025-05-07T20:32:12.4129864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4130211Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4130512Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4131247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4132065Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4132639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4133361Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4134058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4134614Z kernel = self.compile( 2025-05-07T20:32:12.4135294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4136028Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4136462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4136747Z 2025-05-07T20:32:12.4136966Z self = 2025-05-07T20:32:12.4138105Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4139555Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f963959ede0>} 2025-05-07T20:32:12.4140965Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4142035Z context = 2025-05-07T20:32:12.4142343Z 2025-05-07T20:32:12.4142518Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4143068Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4143562Z module_map=module_map) 2025-05-07T20:32:12.4143943Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4144314Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4144596Z E ^ 2025-05-07T20:32:12.4145076Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4145551Z 2025-05-07T20:32:12.4145985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4146533Z 2025-05-07T20:32:12.4146644Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4147081Z self=, 2025-05-07T20:32:12.4147500Z T=1, 2025-05-07T20:32:12.4147706Z D=7168, 2025-05-07T20:32:12.4147918Z scale_ub=1200.0, 2025-05-07T20:32:12.4148150Z contiguous=True, 2025-05-07T20:32:12.4148389Z compiled=True, 2025-05-07T20:32:12.4148617Z ) 2025-05-07T20:32:12.4149033Z self = 2025-05-07T20:32:12.4149549Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:12.4149830Z 2025-05-07T20:32:12.4149914Z @given( 2025-05-07T20:32:12.4150162Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4150487Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4150814Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4151165Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4151506Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4151808Z ) 2025-05-07T20:32:12.4152178Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4152637Z def test_silu_mul_quant( 2025-05-07T20:32:12.4152896Z self, 2025-05-07T20:32:12.4153114Z T: int, 2025-05-07T20:32:12.4153329Z D: int, 2025-05-07T20:32:12.4153555Z scale_ub: Optional[float], 2025-05-07T20:32:12.4153851Z contiguous: bool, 2025-05-07T20:32:12.4154110Z compiled: bool, 2025-05-07T20:32:12.4154342Z ) -> None: 2025-05-07T20:32:12.4154575Z torch.manual_seed(2025) 2025-05-07T20:32:12.4154834Z 2025-05-07T20:32:12.4155115Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4155476Z 2025-05-07T20:32:12.4155744Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4156047Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4156378Z x = x_sign * x_clamp 2025-05-07T20:32:12.4156640Z x0 = x[:, :D] 2025-05-07T20:32:12.4156868Z x1 = x[:, D:] 2025-05-07T20:32:12.4157092Z 2025-05-07T20:32:12.4157295Z if contiguous: 2025-05-07T20:32:12.4157585Z x0 = x0.contiguous() 2025-05-07T20:32:12.4157865Z x1 = x1.contiguous() 2025-05-07T20:32:12.4158122Z 2025-05-07T20:32:12.4158320Z if scale_ub is not None: 2025-05-07T20:32:12.4158619Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4158979Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4159308Z ) 2025-05-07T20:32:12.4159513Z else: 2025-05-07T20:32:12.4159743Z scale_ub_tensor = None 2025-05-07T20:32:12.4160012Z 2025-05-07T20:32:12.4160255Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4160592Z op = silu_mul_quant 2025-05-07T20:32:12.4160862Z if compiled: 2025-05-07T20:32:12.4161118Z op = torch.compile(op) 2025-05-07T20:32:12.4161436Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4161733Z 2025-05-07T20:32:12.4161935Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4162120Z 2025-05-07T20:32:12.4162226Z moe/activation_test.py:117: 2025-05-07T20:32:12.4162542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4162891Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4163189Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4163776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4164363Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4165043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4165770Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4166386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4167102Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4167795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4168354Z kernel = self.compile( 2025-05-07T20:32:12.4169008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4169689Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4170112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4170362Z 2025-05-07T20:32:12.4170579Z self = 2025-05-07T20:32:12.4171713Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4173196Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96385ac4a0>} 2025-05-07T20:32:12.4174605Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4175677Z context = 2025-05-07T20:32:12.4175975Z 2025-05-07T20:32:12.4176160Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4176710Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4177270Z module_map=module_map) 2025-05-07T20:32:12.4177659Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4178031Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4178300Z E ^ 2025-05-07T20:32:12.4178790Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4179303Z 2025-05-07T20:32:12.4179753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4180286Z 2025-05-07T20:32:12.4180404Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4180834Z self=, 2025-05-07T20:32:12.4181291Z T=1, 2025-05-07T20:32:12.4181494Z D=7168, 2025-05-07T20:32:12.4181698Z scale_ub=1200.0, 2025-05-07T20:32:12.4181942Z contiguous=False, 2025-05-07T20:32:12.4182189Z compiled=True, 2025-05-07T20:32:12.4193565Z ) 2025-05-07T20:32:12.5486234Z self = 2025-05-07T20:32:12.5486990Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:12.5487365Z 2025-05-07T20:32:12.5487481Z @given( 2025-05-07T20:32:12.5487808Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.5488139Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.5488459Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.5488812Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.5489159Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.5489463Z ) 2025-05-07T20:32:12.5489837Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.5490301Z def test_silu_mul_quant( 2025-05-07T20:32:12.5490560Z self, 2025-05-07T20:32:12.5490781Z T: int, 2025-05-07T20:32:12.5490990Z D: int, 2025-05-07T20:32:12.5491226Z scale_ub: Optional[float], 2025-05-07T20:32:12.5491516Z contiguous: bool, 2025-05-07T20:32:12.5491829Z compiled: bool, 2025-05-07T20:32:12.5492077Z ) -> None: 2025-05-07T20:32:12.5492311Z torch.manual_seed(2025) 2025-05-07T20:32:12.5492566Z 2025-05-07T20:32:12.5492859Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.5493223Z 2025-05-07T20:32:12.5493427Z x_sign = torch.sign(x) 2025-05-07T20:32:12.5494082Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.5494417Z x = x_sign * x_clamp 2025-05-07T20:32:12.5494676Z x0 = x[:, :D] 2025-05-07T20:32:12.5494904Z x1 = x[:, D:] 2025-05-07T20:32:12.5495129Z 2025-05-07T20:32:12.5495333Z if contiguous: 2025-05-07T20:32:12.5495577Z x0 = x0.contiguous() 2025-05-07T20:32:12.5495851Z x1 = x1.contiguous() 2025-05-07T20:32:12.5496110Z 2025-05-07T20:32:12.5496310Z if scale_ub is not None: 2025-05-07T20:32:12.5496596Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.5496953Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.5497277Z ) 2025-05-07T20:32:12.5497494Z else: 2025-05-07T20:32:12.5497724Z scale_ub_tensor = None 2025-05-07T20:32:12.5497984Z 2025-05-07T20:32:12.5498236Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.5498569Z op = silu_mul_quant 2025-05-07T20:32:12.5498838Z if compiled: 2025-05-07T20:32:12.5499108Z op = torch.compile(op) 2025-05-07T20:32:12.5499421Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.5499706Z 2025-05-07T20:32:12.5499919Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.5500098Z 2025-05-07T20:32:12.5500205Z moe/activation_test.py:117: 2025-05-07T20:32:12.5500524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.5500963Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.5501261Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.5501853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.5502519Z return fn(*args, **kwargs) 2025-05-07T20:32:12.5503207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.5503932Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.5504498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.5505208Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.5505904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.5506761Z kernel = self.compile( 2025-05-07T20:32:12.5507326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.5508016Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.5508450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.5508696Z 2025-05-07T20:32:12.5508923Z self = 2025-05-07T20:32:12.5510058Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.5511509Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96385adb20>} 2025-05-07T20:32:12.5512924Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.5513999Z context = 2025-05-07T20:32:12.5514303Z 2025-05-07T20:32:12.5514485Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.5515030Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.5515658Z module_map=module_map) 2025-05-07T20:32:12.5516048Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.5516416Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.5516696Z E ^ 2025-05-07T20:32:12.5517188Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.5517661Z 2025-05-07T20:32:12.5518104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.5518639Z 2025-05-07T20:32:12.5518750Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.5519190Z self=, 2025-05-07T20:32:12.5519627Z T=1, 2025-05-07T20:32:12.5519823Z D=7168, 2025-05-07T20:32:12.5520034Z scale_ub=None, 2025-05-07T20:32:12.5520270Z contiguous=False, 2025-05-07T20:32:12.5520513Z compiled=True, 2025-05-07T20:32:12.5520738Z ) 2025-05-07T20:32:12.8174559Z self = 2025-05-07T20:32:12.8175315Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:12.8175656Z 2025-05-07T20:32:12.8175748Z @given( 2025-05-07T20:32:12.8175988Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.8176314Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.8176939Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.8177278Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.8177623Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.8177922Z ) 2025-05-07T20:32:12.8178298Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.8178848Z def test_silu_mul_quant( 2025-05-07T20:32:12.8179109Z self, 2025-05-07T20:32:12.8179319Z T: int, 2025-05-07T20:32:12.8179525Z D: int, 2025-05-07T20:32:12.8179767Z scale_ub: Optional[float], 2025-05-07T20:32:12.8180052Z contiguous: bool, 2025-05-07T20:32:12.8180301Z compiled: bool, 2025-05-07T20:32:12.8180543Z ) -> None: 2025-05-07T20:32:12.8180772Z torch.manual_seed(2025) 2025-05-07T20:32:12.8181026Z 2025-05-07T20:32:12.8181313Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.8181679Z 2025-05-07T20:32:12.8181881Z x_sign = torch.sign(x) 2025-05-07T20:32:12.8182189Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.8182517Z x = x_sign * x_clamp 2025-05-07T20:32:12.8182767Z x0 = x[:, :D] 2025-05-07T20:32:12.8183005Z x1 = x[:, D:] 2025-05-07T20:32:12.8183231Z 2025-05-07T20:32:12.8183425Z if contiguous: 2025-05-07T20:32:12.8183677Z x0 = x0.contiguous() 2025-05-07T20:32:12.8183955Z x1 = x1.contiguous() 2025-05-07T20:32:12.8184210Z 2025-05-07T20:32:12.8184418Z if scale_ub is not None: 2025-05-07T20:32:12.8184705Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.8185060Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.8185380Z ) 2025-05-07T20:32:12.8185590Z else: 2025-05-07T20:32:12.8185817Z scale_ub_tensor = None 2025-05-07T20:32:12.8186120Z 2025-05-07T20:32:12.8186378Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.8186717Z op = silu_mul_quant 2025-05-07T20:32:12.8186976Z if compiled: 2025-05-07T20:32:12.8187236Z op = torch.compile(op) 2025-05-07T20:32:12.8187549Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.8187832Z 2025-05-07T20:32:12.8188043Z y_fp8, y_scale = fn() 2025-05-07T20:32:12.8188344Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:12.8188650Z 2025-05-07T20:32:12.8188897Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.8189406Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:12.8189718Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:12.8190043Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:12.8190419Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.8190748Z 2025-05-07T20:32:12.8190957Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:12.8191168Z 2025-05-07T20:32:12.8191274Z moe/activation_test.py:126: 2025-05-07T20:32:12.8191588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.8191945Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:12.8192285Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.8193118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:12.8193910Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:12.8194480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.8195198Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.8195919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:12.8196728Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.8197483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:12.8198150Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:12.8198826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:12.8199370Z fn() 2025-05-07T20:32:12.8199902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:12.8200512Z self.fn.run( 2025-05-07T20:32:12.8201002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.8201552Z kernel = self.compile( 2025-05-07T20:32:12.8202141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.8202830Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.8203255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.8203497Z 2025-05-07T20:32:12.8203714Z self = 2025-05-07T20:32:12.8204853Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.8206643Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96385ae840>} 2025-05-07T20:32:12.8208049Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.8209120Z context = 2025-05-07T20:32:12.8209427Z 2025-05-07T20:32:12.8209603Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.8210156Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.8210655Z module_map=module_map) 2025-05-07T20:32:12.8211034Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.8211535Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:12.8211885Z E ^ 2025-05-07T20:32:12.8212369Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.8212845Z 2025-05-07T20:32:12.8213282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.8213826Z 2025-05-07T20:32:12.8213934Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.8214371Z self=, 2025-05-07T20:32:12.8214788Z T=1, 2025-05-07T20:32:12.8214981Z D=5120, 2025-05-07T20:32:12.8215186Z scale_ub=1200.0, 2025-05-07T20:32:12.8215424Z contiguous=False, 2025-05-07T20:32:12.8215660Z compiled=True, 2025-05-07T20:32:12.8215879Z ) 2025-05-07T20:32:12.9730030Z self = 2025-05-07T20:32:12.9730661Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:12.9731051Z 2025-05-07T20:32:12.9731171Z @given( 2025-05-07T20:32:12.9731486Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.9732019Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.9732339Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.9732674Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.9733199Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.9733486Z ) 2025-05-07T20:32:12.9733843Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.9734289Z def test_silu_mul_quant( 2025-05-07T20:32:12.9734541Z self, 2025-05-07T20:32:12.9734833Z T: int, 2025-05-07T20:32:12.9735030Z D: int, 2025-05-07T20:32:12.9735253Z scale_ub: Optional[float], 2025-05-07T20:32:12.9735535Z contiguous: bool, 2025-05-07T20:32:12.9735781Z compiled: bool, 2025-05-07T20:32:12.9736018Z ) -> None: 2025-05-07T20:32:12.9736243Z torch.manual_seed(2025) 2025-05-07T20:32:12.9736486Z 2025-05-07T20:32:12.9736763Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.9737114Z 2025-05-07T20:32:12.9737314Z x_sign = torch.sign(x) 2025-05-07T20:32:12.9737605Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.9737927Z x = x_sign * x_clamp 2025-05-07T20:32:12.9738175Z x0 = x[:, :D] 2025-05-07T20:32:12.9738391Z x1 = x[:, D:] 2025-05-07T20:32:12.9738608Z 2025-05-07T20:32:12.9738799Z if contiguous: 2025-05-07T20:32:12.9739030Z x0 = x0.contiguous() 2025-05-07T20:32:12.9739295Z x1 = x1.contiguous() 2025-05-07T20:32:12.9739546Z 2025-05-07T20:32:12.9739734Z if scale_ub is not None: 2025-05-07T20:32:12.9740012Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.9740366Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.9740677Z ) 2025-05-07T20:32:12.9740878Z else: 2025-05-07T20:32:12.9741098Z scale_ub_tensor = None 2025-05-07T20:32:12.9741349Z 2025-05-07T20:32:12.9741590Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.9741917Z op = silu_mul_quant 2025-05-07T20:32:12.9742177Z if compiled: 2025-05-07T20:32:12.9742425Z op = torch.compile(op) 2025-05-07T20:32:12.9742732Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.9743015Z 2025-05-07T20:32:12.9743207Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.9743380Z 2025-05-07T20:32:12.9743482Z moe/activation_test.py:117: 2025-05-07T20:32:12.9743789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.9744124Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.9744412Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.9745156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.9745741Z return fn(*args, **kwargs) 2025-05-07T20:32:12.9746483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.9747201Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.9747764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.9748470Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.9749159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.9749715Z kernel = self.compile( 2025-05-07T20:32:12.9750278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.9750961Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.9751372Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.9751607Z 2025-05-07T20:32:12.9751822Z self = 2025-05-07T20:32:12.9752939Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.9754451Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96385aff60>} 2025-05-07T20:32:12.9755873Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.9756936Z context = 2025-05-07T20:32:12.9757238Z 2025-05-07T20:32:12.9757410Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.9757946Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.9758418Z module_map=module_map) 2025-05-07T20:32:12.9758792Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.9759153Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.9759417Z E ^ 2025-05-07T20:32:12.9759888Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.9760359Z 2025-05-07T20:32:12.9760786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.9761316Z 2025-05-07T20:32:12.9761432Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.9761850Z self=, 2025-05-07T20:32:12.9762263Z T=1, 2025-05-07T20:32:12.9762447Z D=5120, 2025-05-07T20:32:12.9762646Z scale_ub=1200.0, 2025-05-07T20:32:12.9762869Z contiguous=False, 2025-05-07T20:32:12.9763098Z compiled=False, 2025-05-07T20:32:12.9763310Z ) 2025-05-07T20:32:12.9763633Z self = 2025-05-07T20:32:12.9764139Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:12.9764413Z 2025-05-07T20:32:12.9764496Z @given( 2025-05-07T20:32:12.9764724Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.9765046Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.9765357Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.9765687Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.9766108Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.9766402Z ) 2025-05-07T20:32:12.9766760Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.9767204Z def test_silu_mul_quant( 2025-05-07T20:32:12.9767452Z self, 2025-05-07T20:32:12.9767655Z T: int, 2025-05-07T20:32:12.9767849Z D: int, 2025-05-07T20:32:12.9768071Z scale_ub: Optional[float], 2025-05-07T20:32:12.9768344Z contiguous: bool, 2025-05-07T20:32:12.9768583Z compiled: bool, 2025-05-07T20:32:12.9768808Z ) -> None: 2025-05-07T20:32:12.9769025Z torch.manual_seed(2025) 2025-05-07T20:32:12.9769263Z 2025-05-07T20:32:12.9769539Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.9769892Z 2025-05-07T20:32:12.9770081Z x_sign = torch.sign(x) 2025-05-07T20:32:12.9770377Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.9770701Z x = x_sign * x_clamp 2025-05-07T20:32:12.9770939Z x0 = x[:, :D] 2025-05-07T20:32:12.9771163Z x1 = x[:, D:] 2025-05-07T20:32:12.9771376Z 2025-05-07T20:32:12.9771561Z if contiguous: 2025-05-07T20:32:12.9771858Z x0 = x0.contiguous() 2025-05-07T20:32:12.9772123Z x1 = x1.contiguous() 2025-05-07T20:32:12.9772372Z 2025-05-07T20:32:12.9772560Z if scale_ub is not None: 2025-05-07T20:32:12.9772888Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.9773315Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.9773671Z ) 2025-05-07T20:32:12.9773873Z else: 2025-05-07T20:32:12.9774090Z scale_ub_tensor = None 2025-05-07T20:32:12.9774339Z 2025-05-07T20:32:12.9774634Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.9774959Z op = silu_mul_quant 2025-05-07T20:32:12.9775207Z if compiled: 2025-05-07T20:32:12.9775466Z op = torch.compile(op) 2025-05-07T20:32:12.9775769Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.9776068Z 2025-05-07T20:32:12.9776290Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.9776464Z 2025-05-07T20:32:12.9776567Z moe/activation_test.py:117: 2025-05-07T20:32:12.9776869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.9777206Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.9777499Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.9778209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.9778913Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.9779471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.9780174Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.9780864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.9781405Z kernel = self.compile( 2025-05-07T20:32:12.9781960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.9782638Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.9783043Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.9783285Z 2025-05-07T20:32:12.9783495Z self = 2025-05-07T20:32:12.9784616Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.9786129Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f963966e980>} 2025-05-07T20:32:12.9787515Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.9788567Z context = 2025-05-07T20:32:12.9788870Z 2025-05-07T20:32:12.9789040Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.9789578Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.9790055Z module_map=module_map) 2025-05-07T20:32:12.9790423Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.9790786Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.9791048Z E ^ 2025-05-07T20:32:12.9791522Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.9791991Z 2025-05-07T20:32:12.9792420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.9792956Z 2025-05-07T20:32:12.9793059Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.9793481Z self=, 2025-05-07T20:32:12.9793933Z T=16384, 2025-05-07T20:32:12.9794132Z D=5120, 2025-05-07T20:32:12.9794347Z scale_ub=1200.0, 2025-05-07T20:32:12.9794568Z contiguous=False, 2025-05-07T20:32:12.9794799Z compiled=True, 2025-05-07T20:32:12.9795003Z ) 2025-05-07T20:32:13.0661153Z self = 2025-05-07T20:32:13.0661804Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:13.0662093Z 2025-05-07T20:32:13.0662173Z @given( 2025-05-07T20:32:13.0662432Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.0662747Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.0663052Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.0663385Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.0663717Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.0664009Z ) 2025-05-07T20:32:13.0664468Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.0665006Z def test_silu_mul_quant( 2025-05-07T20:32:13.0665331Z self, 2025-05-07T20:32:13.0665804Z T: int, 2025-05-07T20:32:13.0666088Z D: int, 2025-05-07T20:32:13.0676118Z scale_ub: Optional[float], 2025-05-07T20:32:13.0676539Z contiguous: bool, 2025-05-07T20:32:13.0676932Z compiled: bool, 2025-05-07T20:32:13.0677296Z ) -> None: 2025-05-07T20:32:13.0677528Z torch.manual_seed(2025) 2025-05-07T20:32:13.0677788Z 2025-05-07T20:32:13.0678064Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.0678416Z 2025-05-07T20:32:13.0678615Z x_sign = torch.sign(x) 2025-05-07T20:32:13.0678917Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.0679230Z x = x_sign * x_clamp 2025-05-07T20:32:13.0679656Z x0 = x[:, :D] 2025-05-07T20:32:13.0679888Z x1 = x[:, D:] 2025-05-07T20:32:13.0680095Z 2025-05-07T20:32:13.0680287Z if contiguous: 2025-05-07T20:32:13.0680523Z x0 = x0.contiguous() 2025-05-07T20:32:13.0680776Z x1 = x1.contiguous() 2025-05-07T20:32:13.0681014Z 2025-05-07T20:32:13.0681207Z if scale_ub is not None: 2025-05-07T20:32:13.0681480Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.0681828Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.0682151Z ) 2025-05-07T20:32:13.0682345Z else: 2025-05-07T20:32:13.0682908Z scale_ub_tensor = None 2025-05-07T20:32:13.0683172Z 2025-05-07T20:32:13.0683412Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.0683732Z op = silu_mul_quant 2025-05-07T20:32:13.0683990Z if compiled: 2025-05-07T20:32:13.0684242Z op = torch.compile(op) 2025-05-07T20:32:13.0684539Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.0684824Z 2025-05-07T20:32:13.0685022Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.0685189Z 2025-05-07T20:32:13.0685290Z moe/activation_test.py:117: 2025-05-07T20:32:13.0685597Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.0685943Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.0686227Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.0686804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:13.0687391Z return fn(*args, **kwargs) 2025-05-07T20:32:13.0688070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.0688773Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.0689328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.0690126Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.0690808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.0691350Z kernel = self.compile( 2025-05-07T20:32:13.0691982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.0692758Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.0693177Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.0693422Z 2025-05-07T20:32:13.0693635Z self = 2025-05-07T20:32:13.0694755Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.0696208Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f963966fec0>} 2025-05-07T20:32:13.0697603Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.0698665Z context = 2025-05-07T20:32:13.0698966Z 2025-05-07T20:32:13.0699142Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.0699680Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.0700160Z module_map=module_map) 2025-05-07T20:32:13.0700533Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.0700901Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.0701167Z E ^ 2025-05-07T20:32:13.0701647Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.0702118Z 2025-05-07T20:32:13.0702548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.0703089Z 2025-05-07T20:32:13.0703193Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.0703619Z self=, 2025-05-07T20:32:13.0704116Z T=2048, 2025-05-07T20:32:13.0704313Z D=7168, 2025-05-07T20:32:13.0704510Z scale_ub=1200.0, 2025-05-07T20:32:13.0704733Z contiguous=False, 2025-05-07T20:32:13.0704962Z compiled=True, 2025-05-07T20:32:13.0705175Z ) 2025-05-07T20:32:13.0705498Z self = 2025-05-07T20:32:13.0706044Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:13.0706816Z 2025-05-07T20:32:13.0706909Z @given( 2025-05-07T20:32:13.0707142Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.0707462Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.0707779Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.0708121Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.0708444Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.0708732Z ) 2025-05-07T20:32:13.0709095Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.0709548Z def test_silu_mul_quant( 2025-05-07T20:32:13.0709793Z self, 2025-05-07T20:32:13.0709990Z T: int, 2025-05-07T20:32:13.0710185Z D: int, 2025-05-07T20:32:13.0710406Z scale_ub: Optional[float], 2025-05-07T20:32:13.0710681Z contiguous: bool, 2025-05-07T20:32:13.0710918Z compiled: bool, 2025-05-07T20:32:13.0711232Z ) -> None: 2025-05-07T20:32:13.0711449Z torch.manual_seed(2025) 2025-05-07T20:32:13.0711689Z 2025-05-07T20:32:13.0711965Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.0712313Z 2025-05-07T20:32:13.0712504Z x_sign = torch.sign(x) 2025-05-07T20:32:13.0712869Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.0713186Z x = x_sign * x_clamp 2025-05-07T20:32:13.0713428Z x0 = x[:, :D] 2025-05-07T20:32:13.0713643Z x1 = x[:, D:] 2025-05-07T20:32:13.0713858Z 2025-05-07T20:32:13.0714045Z if contiguous: 2025-05-07T20:32:13.0714272Z x0 = x0.contiguous() 2025-05-07T20:32:13.0714536Z x1 = x1.contiguous() 2025-05-07T20:32:13.0714774Z 2025-05-07T20:32:13.0714955Z if scale_ub is not None: 2025-05-07T20:32:13.0715230Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.0715569Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.0715879Z ) 2025-05-07T20:32:13.0716075Z else: 2025-05-07T20:32:13.0716289Z scale_ub_tensor = None 2025-05-07T20:32:13.0716540Z 2025-05-07T20:32:13.0716775Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.0717095Z op = silu_mul_quant 2025-05-07T20:32:13.0717352Z if compiled: 2025-05-07T20:32:13.0717606Z op = torch.compile(op) 2025-05-07T20:32:13.0717911Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.0718190Z 2025-05-07T20:32:13.0718384Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.0718559Z 2025-05-07T20:32:13.0718658Z moe/activation_test.py:117: 2025-05-07T20:32:13.0718962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.0719298Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.0719587Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.0720162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:13.0720734Z return fn(*args, **kwargs) 2025-05-07T20:32:13.0721413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.0722128Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.0722679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.0723496Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.0724179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.0724727Z kernel = self.compile( 2025-05-07T20:32:13.0725281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.0725953Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.0726409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.0726641Z 2025-05-07T20:32:13.0726856Z self = 2025-05-07T20:32:13.0727969Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.0729389Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f963966fc40>} 2025-05-07T20:32:13.0730776Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.0731946Z context = 2025-05-07T20:32:13.0732243Z 2025-05-07T20:32:13.0732421Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.0732952Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.0733481Z module_map=module_map) 2025-05-07T20:32:13.0733856Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.0734218Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.0734478Z E ^ 2025-05-07T20:32:13.0734958Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.0735422Z 2025-05-07T20:32:13.0735853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.0736384Z 2025-05-07T20:32:13.1876843Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.1877518Z self=, 2025-05-07T20:32:13.1878109Z T=1, 2025-05-07T20:32:13.1878295Z D=5120, 2025-05-07T20:32:13.1878483Z scale_ub=None, 2025-05-07T20:32:13.1878697Z contiguous=False, 2025-05-07T20:32:13.1878928Z compiled=False, 2025-05-07T20:32:13.1879154Z ) 2025-05-07T20:32:13.1879477Z self = 2025-05-07T20:32:13.1879983Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:13.1880266Z 2025-05-07T20:32:13.1880354Z @given( 2025-05-07T20:32:13.1880586Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.1880907Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.1881221Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.1881559Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.1881887Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.1882193Z ) 2025-05-07T20:32:13.1882550Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.1883000Z def test_silu_mul_quant( 2025-05-07T20:32:13.1883250Z self, 2025-05-07T20:32:13.1883449Z T: int, 2025-05-07T20:32:13.1883649Z D: int, 2025-05-07T20:32:13.1883874Z scale_ub: Optional[float], 2025-05-07T20:32:13.1884154Z contiguous: bool, 2025-05-07T20:32:13.1884389Z compiled: bool, 2025-05-07T20:32:13.1884620Z ) -> None: 2025-05-07T20:32:13.1885200Z torch.manual_seed(2025) 2025-05-07T20:32:13.1885444Z 2025-05-07T20:32:13.1885718Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.1886081Z 2025-05-07T20:32:13.1886281Z x_sign = torch.sign(x) 2025-05-07T20:32:13.1886574Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.1886891Z x = x_sign * x_clamp 2025-05-07T20:32:13.1887143Z x0 = x[:, :D] 2025-05-07T20:32:13.1887363Z x1 = x[:, D:] 2025-05-07T20:32:13.1887574Z 2025-05-07T20:32:13.1887765Z if contiguous: 2025-05-07T20:32:13.1887999Z x0 = x0.contiguous() 2025-05-07T20:32:13.1888264Z x1 = x1.contiguous() 2025-05-07T20:32:13.1888510Z 2025-05-07T20:32:13.1888710Z if scale_ub is not None: 2025-05-07T20:32:13.1888983Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.1889328Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.1889651Z ) 2025-05-07T20:32:13.1889842Z else: 2025-05-07T20:32:13.1890060Z scale_ub_tensor = None 2025-05-07T20:32:13.1890314Z 2025-05-07T20:32:13.1890545Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.1890867Z op = silu_mul_quant 2025-05-07T20:32:13.1891127Z if compiled: 2025-05-07T20:32:13.1891372Z op = torch.compile(op) 2025-05-07T20:32:13.1891882Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.1892163Z 2025-05-07T20:32:13.1892353Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.1892526Z 2025-05-07T20:32:13.1892626Z moe/activation_test.py:117: 2025-05-07T20:32:13.1892931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.1893356Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.1893638Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.1894364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.1895077Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.1895623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.1896331Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.1897021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.1897574Z kernel = self.compile( 2025-05-07T20:32:13.1898125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.1898800Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.1899210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.1899445Z 2025-05-07T20:32:13.1899665Z self = 2025-05-07T20:32:13.1900775Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.1902208Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96ce38cb80>} 2025-05-07T20:32:13.1903604Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.1904666Z context = 2025-05-07T20:32:13.1904960Z 2025-05-07T20:32:13.1905130Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.1905755Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.1906505Z module_map=module_map) 2025-05-07T20:32:13.1906877Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.1907234Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.1907501Z E ^ 2025-05-07T20:32:13.1907978Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.1908442Z 2025-05-07T20:32:13.1908870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.1909407Z 2025-05-07T20:32:13.1909511Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.1909942Z self=, 2025-05-07T20:32:13.1910356Z T=4096, 2025-05-07T20:32:13.1910549Z D=7168, 2025-05-07T20:32:13.1910749Z scale_ub=1200.0, 2025-05-07T20:32:13.1910984Z contiguous=False, 2025-05-07T20:32:13.1911211Z compiled=False, 2025-05-07T20:32:13.1911424Z ) 2025-05-07T20:32:13.1911749Z self = 2025-05-07T20:32:13.1912257Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:13.1912548Z 2025-05-07T20:32:13.1912627Z @given( 2025-05-07T20:32:13.1912936Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.1913256Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.1913564Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.1913900Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.1914237Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.1914590Z ) 2025-05-07T20:32:13.1914944Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.1915395Z def test_silu_mul_quant( 2025-05-07T20:32:13.1915639Z self, 2025-05-07T20:32:13.1915839Z T: int, 2025-05-07T20:32:13.1916039Z D: int, 2025-05-07T20:32:13.1916267Z scale_ub: Optional[float], 2025-05-07T20:32:13.1916585Z contiguous: bool, 2025-05-07T20:32:13.1916828Z compiled: bool, 2025-05-07T20:32:13.1917051Z ) -> None: 2025-05-07T20:32:13.1917271Z torch.manual_seed(2025) 2025-05-07T20:32:13.1917520Z 2025-05-07T20:32:13.1917797Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.1918139Z 2025-05-07T20:32:13.1918340Z x_sign = torch.sign(x) 2025-05-07T20:32:13.1918637Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.1918964Z x = x_sign * x_clamp 2025-05-07T20:32:13.1919216Z x0 = x[:, :D] 2025-05-07T20:32:13.1919438Z x1 = x[:, D:] 2025-05-07T20:32:13.1919642Z 2025-05-07T20:32:13.1919832Z if contiguous: 2025-05-07T20:32:13.1920067Z x0 = x0.contiguous() 2025-05-07T20:32:13.1920334Z x1 = x1.contiguous() 2025-05-07T20:32:13.1920574Z 2025-05-07T20:32:13.1920772Z if scale_ub is not None: 2025-05-07T20:32:13.1921047Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.1921381Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.1921697Z ) 2025-05-07T20:32:13.1921896Z else: 2025-05-07T20:32:13.1922108Z scale_ub_tensor = None 2025-05-07T20:32:13.1922365Z 2025-05-07T20:32:13.1922605Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.1922923Z op = silu_mul_quant 2025-05-07T20:32:13.1923184Z if compiled: 2025-05-07T20:32:13.1923437Z op = torch.compile(op) 2025-05-07T20:32:13.1923734Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.1924014Z 2025-05-07T20:32:13.1924212Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.1924377Z 2025-05-07T20:32:13.1924486Z moe/activation_test.py:117: 2025-05-07T20:32:13.1924912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.1925256Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.1925543Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.1926252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.1926964Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.1927519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.1928217Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.1928906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.1929457Z kernel = self.compile( 2025-05-07T20:32:13.1930016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.1930686Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.1931091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.1931329Z 2025-05-07T20:32:13.1931539Z self = 2025-05-07T20:32:13.1932756Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.1934177Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96d4879800>} 2025-05-07T20:32:13.1935616Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.1936678Z context = 2025-05-07T20:32:13.1936970Z 2025-05-07T20:32:13.1937148Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.1937685Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.1938160Z module_map=module_map) 2025-05-07T20:32:13.1938531Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.1938892Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.1939151Z E ^ 2025-05-07T20:32:13.1939629Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.1940094Z 2025-05-07T20:32:13.1940532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.1941063Z 2025-05-07T20:32:13.1941170Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.1941589Z self=, 2025-05-07T20:32:13.1942003Z T=16384, 2025-05-07T20:32:13.1942201Z D=7168, 2025-05-07T20:32:13.1942393Z scale_ub=None, 2025-05-07T20:32:13.1942615Z contiguous=True, 2025-05-07T20:32:13.1942845Z compiled=True, 2025-05-07T20:32:13.1943045Z ) 2025-05-07T20:32:13.3696125Z self = 2025-05-07T20:32:13.3696880Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:13.3697169Z 2025-05-07T20:32:13.3697260Z @given( 2025-05-07T20:32:13.3697510Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.3697829Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.3698139Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.3698815Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.3699153Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.3699447Z ) 2025-05-07T20:32:13.3699805Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.3700250Z def test_silu_mul_quant( 2025-05-07T20:32:13.3700496Z self, 2025-05-07T20:32:13.3700698Z T: int, 2025-05-07T20:32:13.3700898Z D: int, 2025-05-07T20:32:13.3701119Z scale_ub: Optional[float], 2025-05-07T20:32:13.3701397Z contiguous: bool, 2025-05-07T20:32:13.3701637Z compiled: bool, 2025-05-07T20:32:13.3701871Z ) -> None: 2025-05-07T20:32:13.3702093Z torch.manual_seed(2025) 2025-05-07T20:32:13.3702335Z 2025-05-07T20:32:13.3702609Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.3702955Z 2025-05-07T20:32:13.3703146Z x_sign = torch.sign(x) 2025-05-07T20:32:13.3703449Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.3703762Z x = x_sign * x_clamp 2025-05-07T20:32:13.3704001Z x0 = x[:, :D] 2025-05-07T20:32:13.3704228Z x1 = x[:, D:] 2025-05-07T20:32:13.3704438Z 2025-05-07T20:32:13.3704627Z if contiguous: 2025-05-07T20:32:13.3704856Z x0 = x0.contiguous() 2025-05-07T20:32:13.3705120Z x1 = x1.contiguous() 2025-05-07T20:32:13.3705450Z 2025-05-07T20:32:13.3705638Z if scale_ub is not None: 2025-05-07T20:32:13.3705914Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.3706570Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.3706881Z ) 2025-05-07T20:32:13.3707077Z else: 2025-05-07T20:32:13.3707387Z scale_ub_tensor = None 2025-05-07T20:32:13.3707635Z 2025-05-07T20:32:13.3707868Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.3708188Z op = silu_mul_quant 2025-05-07T20:32:13.3708444Z if compiled: 2025-05-07T20:32:13.3708695Z op = torch.compile(op) 2025-05-07T20:32:13.3708995Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.3709270Z 2025-05-07T20:32:13.3709466Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.3709661Z 2025-05-07T20:32:13.3709762Z moe/activation_test.py:117: 2025-05-07T20:32:13.3710066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.3710431Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.3710716Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.3711293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:13.3711871Z return fn(*args, **kwargs) 2025-05-07T20:32:13.3712545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.3713252Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.3713807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.3714509Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.3715186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.3715741Z kernel = self.compile( 2025-05-07T20:32:13.3716298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.3716969Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.3717382Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.3717627Z 2025-05-07T20:32:13.3717841Z self = 2025-05-07T20:32:13.3719105Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.3720546Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96cf3f0860>} 2025-05-07T20:32:13.3732285Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.3733502Z context = 2025-05-07T20:32:13.3733805Z 2025-05-07T20:32:13.3733986Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.3734528Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.3735010Z module_map=module_map) 2025-05-07T20:32:13.3735388Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.3735751Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.3736017Z E ^ 2025-05-07T20:32:13.3736493Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.3737090Z 2025-05-07T20:32:13.3737523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.3738056Z 2025-05-07T20:32:13.3738165Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.3738588Z self=, 2025-05-07T20:32:13.3739042Z T=4096, 2025-05-07T20:32:13.3739231Z D=5120, 2025-05-07T20:32:13.3739423Z scale_ub=None, 2025-05-07T20:32:13.3739645Z contiguous=False, 2025-05-07T20:32:13.3739875Z compiled=True, 2025-05-07T20:32:13.3740087Z ) 2025-05-07T20:32:13.3740412Z self = 2025-05-07T20:32:13.3740919Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:13.3741198Z 2025-05-07T20:32:13.3741283Z @given( 2025-05-07T20:32:13.3741512Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.3741829Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.3742141Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.3742480Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.3742807Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.3743095Z ) 2025-05-07T20:32:13.3743451Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.3743902Z def test_silu_mul_quant( 2025-05-07T20:32:13.3744147Z self, 2025-05-07T20:32:13.3744347Z T: int, 2025-05-07T20:32:13.3744543Z D: int, 2025-05-07T20:32:13.3744768Z scale_ub: Optional[float], 2025-05-07T20:32:13.3745043Z contiguous: bool, 2025-05-07T20:32:13.3745279Z compiled: bool, 2025-05-07T20:32:13.3745506Z ) -> None: 2025-05-07T20:32:13.3745722Z torch.manual_seed(2025) 2025-05-07T20:32:13.3745961Z 2025-05-07T20:32:13.3746237Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.3746588Z 2025-05-07T20:32:13.3746779Z x_sign = torch.sign(x) 2025-05-07T20:32:13.3747075Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.3747390Z x = x_sign * x_clamp 2025-05-07T20:32:13.3747634Z x0 = x[:, :D] 2025-05-07T20:32:13.3747848Z x1 = x[:, D:] 2025-05-07T20:32:13.3748061Z 2025-05-07T20:32:13.3748247Z if contiguous: 2025-05-07T20:32:13.3748475Z x0 = x0.contiguous() 2025-05-07T20:32:13.3748736Z x1 = x1.contiguous() 2025-05-07T20:32:13.3748981Z 2025-05-07T20:32:13.3749255Z if scale_ub is not None: 2025-05-07T20:32:13.3749532Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.3749878Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.3750188Z ) 2025-05-07T20:32:13.3750385Z else: 2025-05-07T20:32:13.3750600Z scale_ub_tensor = None 2025-05-07T20:32:13.3750848Z 2025-05-07T20:32:13.3751083Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.3751408Z op = silu_mul_quant 2025-05-07T20:32:13.3751658Z if compiled: 2025-05-07T20:32:13.3751911Z op = torch.compile(op) 2025-05-07T20:32:13.3752216Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.3752494Z 2025-05-07T20:32:13.3752689Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.3752862Z 2025-05-07T20:32:13.3752967Z moe/activation_test.py:117: 2025-05-07T20:32:13.3753270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.3753611Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.3753900Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.3754475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:13.3755048Z return fn(*args, **kwargs) 2025-05-07T20:32:13.3755727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.3756557Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.3757109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.3757806Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.3758537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.3759085Z kernel = self.compile( 2025-05-07T20:32:13.3759642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.3760325Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.3760734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.3760967Z 2025-05-07T20:32:13.3761184Z self = 2025-05-07T20:32:13.3762300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.3763729Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9638a579c0>} 2025-05-07T20:32:13.3765129Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.3766194Z context = 2025-05-07T20:32:13.3766489Z 2025-05-07T20:32:13.3766664Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.3767200Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.3767681Z module_map=module_map) 2025-05-07T20:32:13.3768054Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.3768409Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.3768670Z E ^ 2025-05-07T20:32:13.3769149Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.3769611Z 2025-05-07T20:32:13.3770129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.3770661Z 2025-05-07T20:32:13.5223686Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.5224355Z self=, 2025-05-07T20:32:13.5224859Z T=4096, 2025-05-07T20:32:13.5225052Z D=5120, 2025-05-07T20:32:13.5225242Z scale_ub=1200.0, 2025-05-07T20:32:13.5225491Z contiguous=False, 2025-05-07T20:32:13.5225725Z compiled=False, 2025-05-07T20:32:13.5225941Z ) 2025-05-07T20:32:13.5226273Z self = 2025-05-07T20:32:13.5226799Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:13.5227095Z 2025-05-07T20:32:13.5227178Z @given( 2025-05-07T20:32:13.5227423Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.5227752Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.5228078Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.5228427Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.5228771Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.5229071Z ) 2025-05-07T20:32:13.5229462Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.5229915Z def test_silu_mul_quant( 2025-05-07T20:32:13.5230420Z self, 2025-05-07T20:32:13.5230625Z T: int, 2025-05-07T20:32:13.5230832Z D: int, 2025-05-07T20:32:13.5231055Z scale_ub: Optional[float], 2025-05-07T20:32:13.5231336Z contiguous: bool, 2025-05-07T20:32:13.5231586Z compiled: bool, 2025-05-07T20:32:13.5231818Z ) -> None: 2025-05-07T20:32:13.5232136Z torch.manual_seed(2025) 2025-05-07T20:32:13.5232391Z 2025-05-07T20:32:13.5232667Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.5233023Z 2025-05-07T20:32:13.5233230Z x_sign = torch.sign(x) 2025-05-07T20:32:13.5233527Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.5233850Z x = x_sign * x_clamp 2025-05-07T20:32:13.5234103Z x0 = x[:, :D] 2025-05-07T20:32:13.5234324Z x1 = x[:, D:] 2025-05-07T20:32:13.5234541Z 2025-05-07T20:32:13.5234743Z if contiguous: 2025-05-07T20:32:13.5234987Z x0 = x0.contiguous() 2025-05-07T20:32:13.5235260Z x1 = x1.contiguous() 2025-05-07T20:32:13.5235511Z 2025-05-07T20:32:13.5235707Z if scale_ub is not None: 2025-05-07T20:32:13.5235990Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.5236340Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.5236662Z ) 2025-05-07T20:32:13.5236860Z else: 2025-05-07T20:32:13.5237082Z scale_ub_tensor = None 2025-05-07T20:32:13.5237346Z 2025-05-07T20:32:13.5237583Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.5237917Z op = silu_mul_quant 2025-05-07T20:32:13.5238182Z if compiled: 2025-05-07T20:32:13.5238435Z op = torch.compile(op) 2025-05-07T20:32:13.5238742Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.5239031Z 2025-05-07T20:32:13.5239231Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.5239410Z 2025-05-07T20:32:13.5239516Z moe/activation_test.py:117: 2025-05-07T20:32:13.5239833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.5240181Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.5240468Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.5241191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.5241915Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.5242628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.5243343Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.5244038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.5244595Z kernel = self.compile( 2025-05-07T20:32:13.5245153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.5245851Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.5246269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.5246510Z 2025-05-07T20:32:13.5246731Z self = 2025-05-07T20:32:13.5247862Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.5249305Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9638a56d40>} 2025-05-07T20:32:13.5250713Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.5251948Z context = 2025-05-07T20:32:13.5252249Z 2025-05-07T20:32:13.5252433Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.5252976Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.5253518Z module_map=module_map) 2025-05-07T20:32:13.5253897Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.5254264Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.5254536Z E ^ 2025-05-07T20:32:13.5255020Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.5255488Z 2025-05-07T20:32:13.5255935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.5256471Z 2025-05-07T20:32:13.5256577Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.5257011Z self=, 2025-05-07T20:32:13.5257432Z T=4096, 2025-05-07T20:32:13.5257623Z D=5120, 2025-05-07T20:32:13.5257825Z scale_ub=1200.0, 2025-05-07T20:32:13.5258060Z contiguous=False, 2025-05-07T20:32:13.5258294Z compiled=True, 2025-05-07T20:32:13.5258499Z ) 2025-05-07T20:32:13.5258830Z self = 2025-05-07T20:32:13.5259352Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:13.5259638Z 2025-05-07T20:32:13.5259718Z @given( 2025-05-07T20:32:13.5259960Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.5260287Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.5260601Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.5260945Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.5261284Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.5261575Z ) 2025-05-07T20:32:13.5261941Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.5262397Z def test_silu_mul_quant( 2025-05-07T20:32:13.5262650Z self, 2025-05-07T20:32:13.5262849Z T: int, 2025-05-07T20:32:13.5263053Z D: int, 2025-05-07T20:32:13.5263285Z scale_ub: Optional[float], 2025-05-07T20:32:13.5263558Z contiguous: bool, 2025-05-07T20:32:13.5263921Z compiled: bool, 2025-05-07T20:32:13.5264156Z ) -> None: 2025-05-07T20:32:13.5264376Z torch.manual_seed(2025) 2025-05-07T20:32:13.5264627Z 2025-05-07T20:32:13.5264912Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.5265261Z 2025-05-07T20:32:13.5265463Z x_sign = torch.sign(x) 2025-05-07T20:32:13.5265763Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.5266084Z x = x_sign * x_clamp 2025-05-07T20:32:13.5266335Z x0 = x[:, :D] 2025-05-07T20:32:13.5266563Z x1 = x[:, D:] 2025-05-07T20:32:13.5266772Z 2025-05-07T20:32:13.5266962Z if contiguous: 2025-05-07T20:32:13.5267204Z x0 = x0.contiguous() 2025-05-07T20:32:13.5267476Z x1 = x1.contiguous() 2025-05-07T20:32:13.5267721Z 2025-05-07T20:32:13.5267921Z if scale_ub is not None: 2025-05-07T20:32:13.5268205Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.5268550Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.5268870Z ) 2025-05-07T20:32:13.5269072Z else: 2025-05-07T20:32:13.5269286Z scale_ub_tensor = None 2025-05-07T20:32:13.5269545Z 2025-05-07T20:32:13.5269785Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.5270106Z op = silu_mul_quant 2025-05-07T20:32:13.5270423Z if compiled: 2025-05-07T20:32:13.5270683Z op = torch.compile(op) 2025-05-07T20:32:13.5270983Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.5271272Z 2025-05-07T20:32:13.5271475Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.5271647Z 2025-05-07T20:32:13.5271755Z moe/activation_test.py:117: 2025-05-07T20:32:13.5272106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.5272451Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.5272742Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.5273321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:13.5273903Z return fn(*args, **kwargs) 2025-05-07T20:32:13.5274586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.5275293Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.5275853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.5276614Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.5277301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.5277852Z kernel = self.compile( 2025-05-07T20:32:13.5278415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.5279102Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.5279519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.5279757Z 2025-05-07T20:32:13.5279970Z self = 2025-05-07T20:32:13.5281088Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.5282518Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9638a549a0>} 2025-05-07T20:32:13.5284002Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.5285067Z context = 2025-05-07T20:32:13.5285371Z 2025-05-07T20:32:13.5285546Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.5286090Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.5286577Z module_map=module_map) 2025-05-07T20:32:13.5286949Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.5287316Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.5287586Z E ^ 2025-05-07T20:32:13.5288062Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.5288539Z 2025-05-07T20:32:13.5288974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.5289515Z 2025-05-07T20:32:13.6427701Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.6428339Z self=, 2025-05-07T20:32:13.6428778Z T=2048, 2025-05-07T20:32:13.6428974Z D=7168, 2025-05-07T20:32:13.6429171Z scale_ub=1200.0, 2025-05-07T20:32:13.6429409Z contiguous=False, 2025-05-07T20:32:13.6429644Z compiled=False, 2025-05-07T20:32:13.6430133Z ) 2025-05-07T20:32:13.6430476Z self = 2025-05-07T20:32:13.6431005Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:13.6431297Z 2025-05-07T20:32:13.6431387Z @given( 2025-05-07T20:32:13.6431627Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.6432041Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.6432370Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.6432711Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.6433059Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.6433362Z ) 2025-05-07T20:32:13.6433724Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.6434184Z def test_silu_mul_quant( 2025-05-07T20:32:13.6434440Z self, 2025-05-07T20:32:13.6434643Z T: int, 2025-05-07T20:32:13.6434854Z D: int, 2025-05-07T20:32:13.6435086Z scale_ub: Optional[float], 2025-05-07T20:32:13.6435367Z contiguous: bool, 2025-05-07T20:32:13.6435610Z compiled: bool, 2025-05-07T20:32:13.6435848Z ) -> None: 2025-05-07T20:32:13.6436074Z torch.manual_seed(2025) 2025-05-07T20:32:13.6436316Z 2025-05-07T20:32:13.6436599Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.6436962Z 2025-05-07T20:32:13.6437163Z x_sign = torch.sign(x) 2025-05-07T20:32:13.6437465Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.6437797Z x = x_sign * x_clamp 2025-05-07T20:32:13.6438039Z x0 = x[:, :D] 2025-05-07T20:32:13.6438265Z x1 = x[:, D:] 2025-05-07T20:32:13.6438483Z 2025-05-07T20:32:13.6438668Z if contiguous: 2025-05-07T20:32:13.6438910Z x0 = x0.contiguous() 2025-05-07T20:32:13.6439178Z x1 = x1.contiguous() 2025-05-07T20:32:13.6439420Z 2025-05-07T20:32:13.6439624Z if scale_ub is not None: 2025-05-07T20:32:13.6439906Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.6440247Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.6440569Z ) 2025-05-07T20:32:13.6440775Z else: 2025-05-07T20:32:13.6440997Z scale_ub_tensor = None 2025-05-07T20:32:13.6441253Z 2025-05-07T20:32:13.6441496Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.6441825Z op = silu_mul_quant 2025-05-07T20:32:13.6442079Z if compiled: 2025-05-07T20:32:13.6442493Z op = torch.compile(op) 2025-05-07T20:32:13.6442807Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.6443089Z 2025-05-07T20:32:13.6443293Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.6443463Z 2025-05-07T20:32:13.6443575Z moe/activation_test.py:117: 2025-05-07T20:32:13.6443875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.6444222Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.6444516Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.6445238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.6445954Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.6446525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.6447237Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.6447928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.6448490Z kernel = self.compile( 2025-05-07T20:32:13.6449054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.6449741Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.6450206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.6450451Z 2025-05-07T20:32:13.6450666Z self = 2025-05-07T20:32:13.6451870Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.6453387Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9638a556c0>} 2025-05-07T20:32:13.6454920Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.6455998Z context = 2025-05-07T20:32:13.6456306Z 2025-05-07T20:32:13.6456482Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.6457029Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.6457509Z module_map=module_map) 2025-05-07T20:32:13.6457891Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.6458264Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.6458537Z E ^ 2025-05-07T20:32:13.6459026Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.6459501Z 2025-05-07T20:32:13.6459936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.6460473Z 2025-05-07T20:32:13.6460592Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.6461029Z self=, 2025-05-07T20:32:13.6461447Z T=1, 2025-05-07T20:32:13.6461648Z D=7168, 2025-05-07T20:32:13.6461853Z scale_ub=None, 2025-05-07T20:32:13.6462076Z contiguous=True, 2025-05-07T20:32:13.6462314Z compiled=False, 2025-05-07T20:32:13.6462537Z ) 2025-05-07T20:32:13.6462867Z self = 2025-05-07T20:32:13.6463377Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:13.6463668Z 2025-05-07T20:32:13.6463852Z @given( 2025-05-07T20:32:13.6464096Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.6464426Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.6464749Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.6465096Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.6465439Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.6465745Z ) 2025-05-07T20:32:13.6466118Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.6466578Z def test_silu_mul_quant( 2025-05-07T20:32:13.6466835Z self, 2025-05-07T20:32:13.6467046Z T: int, 2025-05-07T20:32:13.6467254Z D: int, 2025-05-07T20:32:13.6467491Z scale_ub: Optional[float], 2025-05-07T20:32:13.6467781Z contiguous: bool, 2025-05-07T20:32:13.6468031Z compiled: bool, 2025-05-07T20:32:13.6468270Z ) -> None: 2025-05-07T20:32:13.6468498Z torch.manual_seed(2025) 2025-05-07T20:32:13.6468750Z 2025-05-07T20:32:13.6469033Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.6479717Z 2025-05-07T20:32:13.6479970Z x_sign = torch.sign(x) 2025-05-07T20:32:13.6480272Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.6480593Z x = x_sign * x_clamp 2025-05-07T20:32:13.6480832Z x0 = x[:, :D] 2025-05-07T20:32:13.6481148Z x1 = x[:, D:] 2025-05-07T20:32:13.6481371Z 2025-05-07T20:32:13.6481565Z if contiguous: 2025-05-07T20:32:13.6481793Z x0 = x0.contiguous() 2025-05-07T20:32:13.6482057Z x1 = x1.contiguous() 2025-05-07T20:32:13.6482298Z 2025-05-07T20:32:13.6482484Z if scale_ub is not None: 2025-05-07T20:32:13.6482812Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.6483159Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.6483476Z ) 2025-05-07T20:32:13.6483688Z else: 2025-05-07T20:32:13.6483917Z scale_ub_tensor = None 2025-05-07T20:32:13.6484169Z 2025-05-07T20:32:13.6484409Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.6484739Z op = silu_mul_quant 2025-05-07T20:32:13.6484991Z if compiled: 2025-05-07T20:32:13.6485247Z op = torch.compile(op) 2025-05-07T20:32:13.6485555Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.6485840Z 2025-05-07T20:32:13.6486037Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.6486211Z 2025-05-07T20:32:13.6486312Z moe/activation_test.py:117: 2025-05-07T20:32:13.6486627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.6486971Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.6487265Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.6487990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.6488699Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.6489269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.6489982Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.6490672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.6491222Z kernel = self.compile( 2025-05-07T20:32:13.6491861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.6492550Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.6492970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.6493214Z 2025-05-07T20:32:13.6493428Z self = 2025-05-07T20:32:13.6494647Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.6496201Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f963914de40>} 2025-05-07T20:32:13.6497603Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.6498662Z context = 2025-05-07T20:32:13.6498965Z 2025-05-07T20:32:13.6499137Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.6499690Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.6500180Z module_map=module_map) 2025-05-07T20:32:13.6500550Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.6500919Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.6501186Z E ^ 2025-05-07T20:32:13.6501663Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.6502196Z 2025-05-07T20:32:13.6502630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.6503172Z 2025-05-07T20:32:13.6503279Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.6503759Z self=, 2025-05-07T20:32:13.6504172Z T=16384, 2025-05-07T20:32:13.6504375Z D=7168, 2025-05-07T20:32:13.6504576Z scale_ub=1200.0, 2025-05-07T20:32:13.6504807Z contiguous=False, 2025-05-07T20:32:13.6505044Z compiled=True, 2025-05-07T20:32:13.8889238Z ) 2025-05-07T20:32:13.8889810Z self = 2025-05-07T20:32:13.8890373Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:13.8890675Z 2025-05-07T20:32:13.8890755Z @given( 2025-05-07T20:32:13.8890994Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.8891344Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.8891652Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.8892082Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.8892423Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.8892713Z ) 2025-05-07T20:32:13.8893081Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.8893536Z def test_silu_mul_quant( 2025-05-07T20:32:13.8893779Z self, 2025-05-07T20:32:13.8893993Z T: int, 2025-05-07T20:32:13.8894199Z D: int, 2025-05-07T20:32:13.8894417Z scale_ub: Optional[float], 2025-05-07T20:32:13.8894695Z contiguous: bool, 2025-05-07T20:32:13.8894952Z compiled: bool, 2025-05-07T20:32:13.8895185Z ) -> None: 2025-05-07T20:32:13.8895411Z torch.manual_seed(2025) 2025-05-07T20:32:13.8895662Z 2025-05-07T20:32:13.8895946Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.8896292Z 2025-05-07T20:32:13.8896494Z x_sign = torch.sign(x) 2025-05-07T20:32:13.8896797Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.8897118Z x = x_sign * x_clamp 2025-05-07T20:32:13.8897368Z x0 = x[:, :D] 2025-05-07T20:32:13.8897593Z x1 = x[:, D:] 2025-05-07T20:32:13.8897803Z 2025-05-07T20:32:13.8897999Z if contiguous: 2025-05-07T20:32:13.8898239Z x0 = x0.contiguous() 2025-05-07T20:32:13.8898871Z x1 = x1.contiguous() 2025-05-07T20:32:13.8899126Z 2025-05-07T20:32:13.8899332Z if scale_ub is not None: 2025-05-07T20:32:13.8899607Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.8899953Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.8900274Z ) 2025-05-07T20:32:13.8900467Z else: 2025-05-07T20:32:13.8900692Z scale_ub_tensor = None 2025-05-07T20:32:13.8900957Z 2025-05-07T20:32:13.8901197Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.8901518Z op = silu_mul_quant 2025-05-07T20:32:13.8901779Z if compiled: 2025-05-07T20:32:13.8902035Z op = torch.compile(op) 2025-05-07T20:32:13.8902334Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.8902624Z 2025-05-07T20:32:13.8902830Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.8902999Z 2025-05-07T20:32:13.8903101Z moe/activation_test.py:117: 2025-05-07T20:32:13.8903422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.8903772Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.8904058Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.8904638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:13.8905222Z return fn(*args, **kwargs) 2025-05-07T20:32:13.8905993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.8907084Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.8907642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.8908453Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.8909141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.8909692Z kernel = self.compile( 2025-05-07T20:32:13.8910251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.8910933Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.8911338Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.8911585Z 2025-05-07T20:32:13.8911798Z self = 2025-05-07T20:32:13.8912921Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.8914369Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f963914fb00>} 2025-05-07T20:32:13.8915767Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.8916824Z context = 2025-05-07T20:32:13.8917127Z 2025-05-07T20:32:13.8917301Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.8917875Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.8918360Z module_map=module_map) 2025-05-07T20:32:13.8918737Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.8919098Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.8919370Z E ^ 2025-05-07T20:32:13.8919851Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.8920442Z 2025-05-07T20:32:13.8920874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.8921410Z 2025-05-07T20:32:13.8921518Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.8921948Z self=, 2025-05-07T20:32:13.8922366Z T=1, 2025-05-07T20:32:13.8922557Z D=7168, 2025-05-07T20:32:13.8922759Z scale_ub=None, 2025-05-07T20:32:13.8922983Z contiguous=False, 2025-05-07T20:32:13.8923215Z compiled=False, 2025-05-07T20:32:13.8923429Z ) 2025-05-07T20:32:13.8923763Z self = 2025-05-07T20:32:13.8924269Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:13.8924547Z 2025-05-07T20:32:13.8924628Z @given( 2025-05-07T20:32:13.8924871Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.8925197Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.8925517Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.8925860Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.8926204Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.8926498Z ) 2025-05-07T20:32:13.8926864Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.8927393Z def test_silu_mul_quant( 2025-05-07T20:32:13.8927639Z self, 2025-05-07T20:32:13.8927846Z T: int, 2025-05-07T20:32:13.8928055Z D: int, 2025-05-07T20:32:13.8928277Z scale_ub: Optional[float], 2025-05-07T20:32:13.8928565Z contiguous: bool, 2025-05-07T20:32:13.8928868Z compiled: bool, 2025-05-07T20:32:13.8929096Z ) -> None: 2025-05-07T20:32:13.8929322Z torch.manual_seed(2025) 2025-05-07T20:32:13.8929575Z 2025-05-07T20:32:13.8929859Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.8930216Z 2025-05-07T20:32:13.8930417Z x_sign = torch.sign(x) 2025-05-07T20:32:13.8930712Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.8931032Z x = x_sign * x_clamp 2025-05-07T20:32:13.8931280Z x0 = x[:, :D] 2025-05-07T20:32:13.8931522Z x1 = x[:, D:] 2025-05-07T20:32:13.8931731Z 2025-05-07T20:32:13.8932019Z if contiguous: 2025-05-07T20:32:13.8932260Z x0 = x0.contiguous() 2025-05-07T20:32:13.8932523Z x1 = x1.contiguous() 2025-05-07T20:32:13.8932776Z 2025-05-07T20:32:13.8932976Z if scale_ub is not None: 2025-05-07T20:32:13.8933259Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.8933601Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.8933927Z ) 2025-05-07T20:32:13.8934130Z else: 2025-05-07T20:32:13.8934344Z scale_ub_tensor = None 2025-05-07T20:32:13.8934606Z 2025-05-07T20:32:13.8934852Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.8935173Z op = silu_mul_quant 2025-05-07T20:32:13.8935434Z if compiled: 2025-05-07T20:32:13.8935690Z op = torch.compile(op) 2025-05-07T20:32:13.8935994Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.8936283Z 2025-05-07T20:32:13.8936482Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.8936652Z 2025-05-07T20:32:13.8936755Z moe/activation_test.py:117: 2025-05-07T20:32:13.8937062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.8937412Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.8937702Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.8938408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.8939121Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.8939795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.8940496Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.8941187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.8941740Z kernel = self.compile( 2025-05-07T20:32:13.8942303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.8942981Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.8943394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.8943633Z 2025-05-07T20:32:13.8943852Z self = 2025-05-07T20:32:13.8944979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.8946440Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96ce1200e0>} 2025-05-07T20:32:13.8947842Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.8948948Z context = 2025-05-07T20:32:13.8949247Z 2025-05-07T20:32:13.8949424Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.8950001Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.8950486Z module_map=module_map) 2025-05-07T20:32:13.8950869Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.8951237Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.8951501Z E ^ 2025-05-07T20:32:13.8951984Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.8952449Z 2025-05-07T20:32:13.8952890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.8953425Z 2025-05-07T20:32:13.8953539Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.8953963Z self=, 2025-05-07T20:32:13.8954385Z T=2048, 2025-05-07T20:32:13.8954586Z D=7168, 2025-05-07T20:32:13.8954779Z scale_ub=None, 2025-05-07T20:32:13.8955007Z contiguous=False, 2025-05-07T20:32:13.8955243Z compiled=True, 2025-05-07T20:32:13.8955447Z ) 2025-05-07T20:32:13.9822335Z self = 2025-05-07T20:32:13.9823123Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:13.9823411Z 2025-05-07T20:32:13.9823491Z @given( 2025-05-07T20:32:13.9823732Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.9824054Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.9824366Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.9824708Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.9825044Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.9825338Z ) 2025-05-07T20:32:13.9825692Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.9826157Z def test_silu_mul_quant( 2025-05-07T20:32:13.9826408Z self, 2025-05-07T20:32:13.9826605Z T: int, 2025-05-07T20:32:13.9826810Z D: int, 2025-05-07T20:32:13.9827038Z scale_ub: Optional[float], 2025-05-07T20:32:13.9827617Z contiguous: bool, 2025-05-07T20:32:13.9827872Z compiled: bool, 2025-05-07T20:32:13.9828109Z ) -> None: 2025-05-07T20:32:13.9828328Z torch.manual_seed(2025) 2025-05-07T20:32:13.9828580Z 2025-05-07T20:32:13.9828864Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.9829214Z 2025-05-07T20:32:13.9829421Z x_sign = torch.sign(x) 2025-05-07T20:32:13.9829725Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.9830046Z x = x_sign * x_clamp 2025-05-07T20:32:13.9830299Z x0 = x[:, :D] 2025-05-07T20:32:13.9830527Z x1 = x[:, D:] 2025-05-07T20:32:13.9830740Z 2025-05-07T20:32:13.9830934Z if contiguous: 2025-05-07T20:32:13.9831181Z x0 = x0.contiguous() 2025-05-07T20:32:13.9831457Z x1 = x1.contiguous() 2025-05-07T20:32:13.9831700Z 2025-05-07T20:32:13.9831898Z if scale_ub is not None: 2025-05-07T20:32:13.9832187Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.9832528Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.9832854Z ) 2025-05-07T20:32:13.9833058Z else: 2025-05-07T20:32:13.9833273Z scale_ub_tensor = None 2025-05-07T20:32:13.9833536Z 2025-05-07T20:32:13.9833779Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.9834184Z op = silu_mul_quant 2025-05-07T20:32:13.9834443Z if compiled: 2025-05-07T20:32:13.9834700Z op = torch.compile(op) 2025-05-07T20:32:13.9834996Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.9835281Z 2025-05-07T20:32:13.9835483Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.9835731Z 2025-05-07T20:32:13.9835839Z moe/activation_test.py:117: 2025-05-07T20:32:13.9836140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.9836486Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.9836781Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.9837358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:13.9837943Z return fn(*args, **kwargs) 2025-05-07T20:32:13.9838627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.9839342Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.9839893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.9840607Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.9841301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.9841856Z kernel = self.compile( 2025-05-07T20:32:13.9842421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.9843104Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.9843518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.9843755Z 2025-05-07T20:32:13.9843969Z self = 2025-05-07T20:32:13.9845093Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.9846533Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9639a56ac0>} 2025-05-07T20:32:13.9848038Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.9849103Z context = 2025-05-07T20:32:13.9849408Z 2025-05-07T20:32:13.9849582Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.9850128Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.9850617Z module_map=module_map) 2025-05-07T20:32:13.9850989Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.9851359Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.9851631Z E ^ 2025-05-07T20:32:13.9852184Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.9852660Z 2025-05-07T20:32:13.9853098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.9853642Z 2025-05-07T20:32:13.9853749Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.9854185Z self=, 2025-05-07T20:32:13.9854604Z T=4096, 2025-05-07T20:32:13.9854805Z D=7168, 2025-05-07T20:32:13.9855009Z scale_ub=None, 2025-05-07T20:32:13.9855229Z contiguous=False, 2025-05-07T20:32:13.9855522Z compiled=True, 2025-05-07T20:32:13.9855738Z ) 2025-05-07T20:32:13.9856071Z self = 2025-05-07T20:32:13.9856593Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:13.9856883Z 2025-05-07T20:32:13.9856964Z @given( 2025-05-07T20:32:13.9857246Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.9857566Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.9857885Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.9858232Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.9858567Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.9858863Z ) 2025-05-07T20:32:13.9859225Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.9859686Z def test_silu_mul_quant( 2025-05-07T20:32:13.9859931Z self, 2025-05-07T20:32:13.9860136Z T: int, 2025-05-07T20:32:13.9860343Z D: int, 2025-05-07T20:32:13.9860564Z scale_ub: Optional[float], 2025-05-07T20:32:13.9860845Z contiguous: bool, 2025-05-07T20:32:13.9861095Z compiled: bool, 2025-05-07T20:32:13.9861321Z ) -> None: 2025-05-07T20:32:13.9861546Z torch.manual_seed(2025) 2025-05-07T20:32:13.9861801Z 2025-05-07T20:32:13.9862078Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.9862433Z 2025-05-07T20:32:13.9862634Z x_sign = torch.sign(x) 2025-05-07T20:32:13.9862935Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.9863259Z x = x_sign * x_clamp 2025-05-07T20:32:13.9863508Z x0 = x[:, :D] 2025-05-07T20:32:13.9863728Z x1 = x[:, D:] 2025-05-07T20:32:13.9863943Z 2025-05-07T20:32:13.9864138Z if contiguous: 2025-05-07T20:32:13.9864373Z x0 = x0.contiguous() 2025-05-07T20:32:13.9864640Z x1 = x1.contiguous() 2025-05-07T20:32:13.9864892Z 2025-05-07T20:32:13.9865086Z if scale_ub is not None: 2025-05-07T20:32:13.9865371Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.9865720Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.9866048Z ) 2025-05-07T20:32:13.9866246Z else: 2025-05-07T20:32:13.9866472Z scale_ub_tensor = None 2025-05-07T20:32:13.9866735Z 2025-05-07T20:32:13.9866969Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.9867318Z op = silu_mul_quant 2025-05-07T20:32:13.9867664Z if compiled: 2025-05-07T20:32:13.9867929Z op = torch.compile(op) 2025-05-07T20:32:13.9868241Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.9868520Z 2025-05-07T20:32:13.9868720Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.9868888Z 2025-05-07T20:32:13.9869001Z moe/activation_test.py:117: 2025-05-07T20:32:13.9869306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.9869658Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.9869951Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.9870526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:13.9871112Z return fn(*args, **kwargs) 2025-05-07T20:32:13.9871797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.9872518Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.9873072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.9873783Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.9874482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.9875083Z kernel = self.compile( 2025-05-07T20:32:13.9875637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.9876319Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.9886600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.9887009Z 2025-05-07T20:32:13.9887228Z self = 2025-05-07T20:32:13.9888362Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.9889789Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9639a55b20>} 2025-05-07T20:32:13.9891185Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.9892338Z context = 2025-05-07T20:32:13.9892636Z 2025-05-07T20:32:13.9892805Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.9893343Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.9893834Z module_map=module_map) 2025-05-07T20:32:13.9894216Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.9894581Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.9894853Z E ^ 2025-05-07T20:32:13.9895339Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.9895810Z 2025-05-07T20:32:13.9896240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.9896783Z 2025-05-07T20:32:14.1469443Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.1470097Z self=, 2025-05-07T20:32:14.1470540Z T=16384, 2025-05-07T20:32:14.1470743Z D=5120, 2025-05-07T20:32:14.1470948Z scale_ub=1200.0, 2025-05-07T20:32:14.1471172Z contiguous=False, 2025-05-07T20:32:14.1471407Z compiled=False, 2025-05-07T20:32:14.1471984Z ) 2025-05-07T20:32:14.1472318Z self = 2025-05-07T20:32:14.1472843Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:14.1473134Z 2025-05-07T20:32:14.1473219Z @given( 2025-05-07T20:32:14.1473451Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.1473782Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.1474102Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.1474441Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.1474772Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.1475065Z ) 2025-05-07T20:32:14.1475425Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.1475882Z def test_silu_mul_quant( 2025-05-07T20:32:14.1476136Z self, 2025-05-07T20:32:14.1476341Z T: int, 2025-05-07T20:32:14.1476551Z D: int, 2025-05-07T20:32:14.1476778Z scale_ub: Optional[float], 2025-05-07T20:32:14.1477059Z contiguous: bool, 2025-05-07T20:32:14.1477304Z compiled: bool, 2025-05-07T20:32:14.1477541Z ) -> None: 2025-05-07T20:32:14.1477763Z torch.manual_seed(2025) 2025-05-07T20:32:14.1478005Z 2025-05-07T20:32:14.1478283Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1478719Z 2025-05-07T20:32:14.1478914Z x_sign = torch.sign(x) 2025-05-07T20:32:14.1479214Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.1479533Z x = x_sign * x_clamp 2025-05-07T20:32:14.1479784Z x0 = x[:, :D] 2025-05-07T20:32:14.1480003Z x1 = x[:, D:] 2025-05-07T20:32:14.1480299Z 2025-05-07T20:32:14.1480494Z if contiguous: 2025-05-07T20:32:14.1480730Z x0 = x0.contiguous() 2025-05-07T20:32:14.1480998Z x1 = x1.contiguous() 2025-05-07T20:32:14.1481252Z 2025-05-07T20:32:14.1481450Z if scale_ub is not None: 2025-05-07T20:32:14.1481732Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.1482080Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.1482392Z ) 2025-05-07T20:32:14.1482592Z else: 2025-05-07T20:32:14.1482812Z scale_ub_tensor = None 2025-05-07T20:32:14.1483067Z 2025-05-07T20:32:14.1483312Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1483639Z op = silu_mul_quant 2025-05-07T20:32:14.1483896Z if compiled: 2025-05-07T20:32:14.1484156Z op = torch.compile(op) 2025-05-07T20:32:14.1484467Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1484760Z 2025-05-07T20:32:14.1484955Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.1485129Z 2025-05-07T20:32:14.1485235Z moe/activation_test.py:117: 2025-05-07T20:32:14.1485543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1485887Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.1486177Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1486891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.1487608Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.1488158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.1488874Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.1489564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.1490115Z kernel = self.compile( 2025-05-07T20:32:14.1490676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.1491448Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.1491969Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1492209Z 2025-05-07T20:32:14.1492424Z self = 2025-05-07T20:32:14.1493547Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.1494993Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9639a54c20>} 2025-05-07T20:32:14.1496391Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.1497457Z context = 2025-05-07T20:32:14.1497761Z 2025-05-07T20:32:14.1497932Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.1498474Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.1498961Z module_map=module_map) 2025-05-07T20:32:14.1499385Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.1499757Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.1500027Z E ^ 2025-05-07T20:32:14.1500503Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.1501021Z 2025-05-07T20:32:14.1501454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.1501994Z 2025-05-07T20:32:14.1502101Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.1502535Z self=, 2025-05-07T20:32:14.1502949Z T=16384, 2025-05-07T20:32:14.1503152Z D=5120, 2025-05-07T20:32:14.1503360Z scale_ub=1200.0, 2025-05-07T20:32:14.1503587Z contiguous=True, 2025-05-07T20:32:14.1503818Z compiled=True, 2025-05-07T20:32:14.1504033Z ) 2025-05-07T20:32:14.1504359Z self = 2025-05-07T20:32:14.1504879Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.1505171Z 2025-05-07T20:32:14.1505253Z @given( 2025-05-07T20:32:14.1505495Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.1505818Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.1506405Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.1506750Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.1507091Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.1507392Z ) 2025-05-07T20:32:14.1507758Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.1508220Z def test_silu_mul_quant( 2025-05-07T20:32:14.1508465Z self, 2025-05-07T20:32:14.1508673Z T: int, 2025-05-07T20:32:14.1508880Z D: int, 2025-05-07T20:32:14.1509101Z scale_ub: Optional[float], 2025-05-07T20:32:14.1509382Z contiguous: bool, 2025-05-07T20:32:14.1509631Z compiled: bool, 2025-05-07T20:32:14.1509856Z ) -> None: 2025-05-07T20:32:14.1510078Z torch.manual_seed(2025) 2025-05-07T20:32:14.1510332Z 2025-05-07T20:32:14.1510606Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1510962Z 2025-05-07T20:32:14.1511163Z x_sign = torch.sign(x) 2025-05-07T20:32:14.1511457Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.1511779Z x = x_sign * x_clamp 2025-05-07T20:32:14.1512157Z x0 = x[:, :D] 2025-05-07T20:32:14.1512381Z x1 = x[:, D:] 2025-05-07T20:32:14.1512599Z 2025-05-07T20:32:14.1512795Z if contiguous: 2025-05-07T20:32:14.1513033Z x0 = x0.contiguous() 2025-05-07T20:32:14.1513300Z x1 = x1.contiguous() 2025-05-07T20:32:14.1513547Z 2025-05-07T20:32:14.1513740Z if scale_ub is not None: 2025-05-07T20:32:14.1514025Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.1514370Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.1514687Z ) 2025-05-07T20:32:14.1514883Z else: 2025-05-07T20:32:14.1515102Z scale_ub_tensor = None 2025-05-07T20:32:14.1515360Z 2025-05-07T20:32:14.1515599Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1515928Z op = silu_mul_quant 2025-05-07T20:32:14.1516189Z if compiled: 2025-05-07T20:32:14.1516438Z op = torch.compile(op) 2025-05-07T20:32:14.1516753Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1517042Z 2025-05-07T20:32:14.1517239Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.1517413Z 2025-05-07T20:32:14.1517516Z moe/activation_test.py:117: 2025-05-07T20:32:14.1517823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1518168Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.1518540Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1519126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.1519707Z return fn(*args, **kwargs) 2025-05-07T20:32:14.1520385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.1521160Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.1521724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.1522433Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.1523124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.1523677Z kernel = self.compile( 2025-05-07T20:32:14.1524242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.1524923Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.1525343Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1525587Z 2025-05-07T20:32:14.1525806Z self = 2025-05-07T20:32:14.1526941Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.1528362Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96385af060>} 2025-05-07T20:32:14.1529749Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.1530814Z context = 2025-05-07T20:32:14.1531133Z 2025-05-07T20:32:14.1531306Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.1531928Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.1532416Z module_map=module_map) 2025-05-07T20:32:14.1532901Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.1533271Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.1533544Z E ^ 2025-05-07T20:32:14.1534021Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.1534495Z 2025-05-07T20:32:14.1534929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.1535473Z 2025-05-07T20:32:14.3232577Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.3233711Z self=, 2025-05-07T20:32:14.3234537Z T=16384, 2025-05-07T20:32:14.3234915Z D=5120, 2025-05-07T20:32:14.3235324Z scale_ub=None, 2025-05-07T20:32:14.3235748Z contiguous=False, 2025-05-07T20:32:14.3236189Z compiled=True, 2025-05-07T20:32:14.3236543Z ) 2025-05-07T20:32:14.3236877Z self = 2025-05-07T20:32:14.3237405Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.3237704Z 2025-05-07T20:32:14.3237785Z @given( 2025-05-07T20:32:14.3238021Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.3238347Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.3238656Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.3239249Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.3239588Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.3239876Z ) 2025-05-07T20:32:14.3240238Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.3240694Z def test_silu_mul_quant( 2025-05-07T20:32:14.3241016Z self, 2025-05-07T20:32:14.3241219Z T: int, 2025-05-07T20:32:14.3241421Z D: int, 2025-05-07T20:32:14.3241637Z scale_ub: Optional[float], 2025-05-07T20:32:14.3241922Z contiguous: bool, 2025-05-07T20:32:14.3242179Z compiled: bool, 2025-05-07T20:32:14.3242411Z ) -> None: 2025-05-07T20:32:14.3242625Z torch.manual_seed(2025) 2025-05-07T20:32:14.3242871Z 2025-05-07T20:32:14.3243149Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.3243496Z 2025-05-07T20:32:14.3243694Z x_sign = torch.sign(x) 2025-05-07T20:32:14.3243995Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.3244313Z x = x_sign * x_clamp 2025-05-07T20:32:14.3244564Z x0 = x[:, :D] 2025-05-07T20:32:14.3244788Z x1 = x[:, D:] 2025-05-07T20:32:14.3244997Z 2025-05-07T20:32:14.3245189Z if contiguous: 2025-05-07T20:32:14.3245427Z x0 = x0.contiguous() 2025-05-07T20:32:14.3245691Z x1 = x1.contiguous() 2025-05-07T20:32:14.3245941Z 2025-05-07T20:32:14.3246138Z if scale_ub is not None: 2025-05-07T20:32:14.3246411Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.3246767Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.3247086Z ) 2025-05-07T20:32:14.3247279Z else: 2025-05-07T20:32:14.3247500Z scale_ub_tensor = None 2025-05-07T20:32:14.3247761Z 2025-05-07T20:32:14.3248002Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.3248324Z op = silu_mul_quant 2025-05-07T20:32:14.3248587Z if compiled: 2025-05-07T20:32:14.3248844Z op = torch.compile(op) 2025-05-07T20:32:14.3249144Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.3249427Z 2025-05-07T20:32:14.3249627Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.3249796Z 2025-05-07T20:32:14.3249897Z moe/activation_test.py:117: 2025-05-07T20:32:14.3250208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.3250555Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.3250989Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.3251575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.3252254Z return fn(*args, **kwargs) 2025-05-07T20:32:14.3252935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.3253648Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.3254208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.3254917Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.3255606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.3256156Z kernel = self.compile( 2025-05-07T20:32:14.3256725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.3257412Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.3257819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.3258068Z 2025-05-07T20:32:14.3258280Z self = 2025-05-07T20:32:14.3259405Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.3260900Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9638200b80>} 2025-05-07T20:32:14.3262344Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.3263403Z context = 2025-05-07T20:32:14.3263711Z 2025-05-07T20:32:14.3263883Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.3264428Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.3264918Z module_map=module_map) 2025-05-07T20:32:14.3265293Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.3265661Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.3265931Z E ^ 2025-05-07T20:32:14.3266409Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.3266883Z 2025-05-07T20:32:14.3267320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.3267869Z 2025-05-07T20:32:14.3267976Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.3268410Z self=, 2025-05-07T20:32:14.3268829Z T=2048, 2025-05-07T20:32:14.3269029Z D=5120, 2025-05-07T20:32:14.3269230Z scale_ub=None, 2025-05-07T20:32:14.3269450Z contiguous=False, 2025-05-07T20:32:14.3269685Z compiled=True, 2025-05-07T20:32:14.3269901Z ) 2025-05-07T20:32:14.4177077Z self = 2025-05-07T20:32:14.4177755Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.4178040Z 2025-05-07T20:32:14.4178119Z @given( 2025-05-07T20:32:14.4178350Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.4178694Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.4179000Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.4179677Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.4180015Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.4180298Z ) 2025-05-07T20:32:14.4180648Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.4181099Z def test_silu_mul_quant( 2025-05-07T20:32:14.4181350Z self, 2025-05-07T20:32:14.4181542Z T: int, 2025-05-07T20:32:14.4181750Z D: int, 2025-05-07T20:32:14.4181970Z scale_ub: Optional[float], 2025-05-07T20:32:14.4182239Z contiguous: bool, 2025-05-07T20:32:14.4182485Z compiled: bool, 2025-05-07T20:32:14.4182719Z ) -> None: 2025-05-07T20:32:14.4182934Z torch.manual_seed(2025) 2025-05-07T20:32:14.4183177Z 2025-05-07T20:32:14.4183457Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.4183801Z 2025-05-07T20:32:14.4183995Z x_sign = torch.sign(x) 2025-05-07T20:32:14.4184292Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.4184614Z x = x_sign * x_clamp 2025-05-07T20:32:14.4184853Z x0 = x[:, :D] 2025-05-07T20:32:14.4185073Z x1 = x[:, D:] 2025-05-07T20:32:14.4185283Z 2025-05-07T20:32:14.4185467Z if contiguous: 2025-05-07T20:32:14.4185700Z x0 = x0.contiguous() 2025-05-07T20:32:14.4185965Z x1 = x1.contiguous() 2025-05-07T20:32:14.4186200Z 2025-05-07T20:32:14.4186475Z if scale_ub is not None: 2025-05-07T20:32:14.4186752Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.4187089Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.4187404Z ) 2025-05-07T20:32:14.4187601Z else: 2025-05-07T20:32:14.4187810Z scale_ub_tensor = None 2025-05-07T20:32:14.4188148Z 2025-05-07T20:32:14.4188385Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.4188700Z op = silu_mul_quant 2025-05-07T20:32:14.4188960Z if compiled: 2025-05-07T20:32:14.4189218Z op = torch.compile(op) 2025-05-07T20:32:14.4189525Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.4189800Z 2025-05-07T20:32:14.4189998Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.4190164Z 2025-05-07T20:32:14.4190270Z moe/activation_test.py:117: 2025-05-07T20:32:14.4190572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.4190920Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.4191213Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.4191782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.4192362Z return fn(*args, **kwargs) 2025-05-07T20:32:14.4193054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.4193762Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.4194317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.4195028Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.4195721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.4196266Z kernel = self.compile( 2025-05-07T20:32:14.4196832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.4197511Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.4197920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.4198156Z 2025-05-07T20:32:14.4198367Z self = 2025-05-07T20:32:14.4199572Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.4201007Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96382020c0>} 2025-05-07T20:32:14.4202402Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.4203467Z context = 2025-05-07T20:32:14.4203761Z 2025-05-07T20:32:14.4203932Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.4204471Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.4204953Z module_map=module_map) 2025-05-07T20:32:14.4205323Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.4205686Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.4205952Z E ^ 2025-05-07T20:32:14.4206795Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.4207263Z 2025-05-07T20:32:14.4207694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.4208300Z 2025-05-07T20:32:14.4208406Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.4208832Z self=, 2025-05-07T20:32:14.4209241Z T=2048, 2025-05-07T20:32:14.4209501Z D=5120, 2025-05-07T20:32:14.4209701Z scale_ub=1200.0, 2025-05-07T20:32:14.4209929Z contiguous=False, 2025-05-07T20:32:14.4210151Z compiled=True, 2025-05-07T20:32:14.4210359Z ) 2025-05-07T20:32:14.4210692Z self = 2025-05-07T20:32:14.4211201Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:14.4211490Z 2025-05-07T20:32:14.4211568Z @given( 2025-05-07T20:32:14.4211880Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.4212194Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.4212510Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.4212868Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.4213202Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.4213492Z ) 2025-05-07T20:32:14.4213854Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.4214313Z def test_silu_mul_quant( 2025-05-07T20:32:14.4214554Z self, 2025-05-07T20:32:14.4214754Z T: int, 2025-05-07T20:32:14.4214958Z D: int, 2025-05-07T20:32:14.4215181Z scale_ub: Optional[float], 2025-05-07T20:32:14.4215461Z contiguous: bool, 2025-05-07T20:32:14.4215710Z compiled: bool, 2025-05-07T20:32:14.4215936Z ) -> None: 2025-05-07T20:32:14.4216159Z torch.manual_seed(2025) 2025-05-07T20:32:14.4216414Z 2025-05-07T20:32:14.4216692Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.4217044Z 2025-05-07T20:32:14.4217248Z x_sign = torch.sign(x) 2025-05-07T20:32:14.4217540Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.4217861Z x = x_sign * x_clamp 2025-05-07T20:32:14.4228675Z x0 = x[:, :D] 2025-05-07T20:32:14.4228976Z x1 = x[:, D:] 2025-05-07T20:32:14.4229190Z 2025-05-07T20:32:14.4229385Z if contiguous: 2025-05-07T20:32:14.4229626Z x0 = x0.contiguous() 2025-05-07T20:32:14.4229896Z x1 = x1.contiguous() 2025-05-07T20:32:14.4230134Z 2025-05-07T20:32:14.4230328Z if scale_ub is not None: 2025-05-07T20:32:14.4230788Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.4231152Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.4231465Z ) 2025-05-07T20:32:14.4231660Z else: 2025-05-07T20:32:14.4231863Z scale_ub_tensor = None 2025-05-07T20:32:14.4232112Z 2025-05-07T20:32:14.4232348Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.4232675Z op = silu_mul_quant 2025-05-07T20:32:14.4232927Z if compiled: 2025-05-07T20:32:14.4233188Z op = torch.compile(op) 2025-05-07T20:32:14.4233491Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.4233769Z 2025-05-07T20:32:14.4233964Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.4234134Z 2025-05-07T20:32:14.4234240Z moe/activation_test.py:117: 2025-05-07T20:32:14.4234544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.4234883Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.4235170Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.4235748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.4236320Z return fn(*args, **kwargs) 2025-05-07T20:32:14.4237048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.4237806Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.4238355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.4239052Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.4239777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.4240325Z kernel = self.compile( 2025-05-07T20:32:14.4240882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.4241559Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.4241967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.4242201Z 2025-05-07T20:32:14.4242417Z self = 2025-05-07T20:32:14.4243533Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.4244956Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96382032e0>} 2025-05-07T20:32:14.4246355Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.4247416Z context = 2025-05-07T20:32:14.4247711Z 2025-05-07T20:32:14.4247886Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.4248419Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.4248905Z module_map=module_map) 2025-05-07T20:32:14.4249278Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.4249637Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.4249904Z E ^ 2025-05-07T20:32:14.4250383Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.4250849Z 2025-05-07T20:32:14.4251374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.4251979Z 2025-05-07T20:32:14.5989914Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.5990895Z self=, 2025-05-07T20:32:14.5991715Z T=4096, 2025-05-07T20:32:14.5992079Z D=5120, 2025-05-07T20:32:14.5992459Z scale_ub=1200.0, 2025-05-07T20:32:14.5992922Z contiguous=True, 2025-05-07T20:32:14.5993361Z compiled=True, 2025-05-07T20:32:14.5993760Z ) 2025-05-07T20:32:14.5994409Z self = 2025-05-07T20:32:14.5995424Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.5995981Z 2025-05-07T20:32:14.5996145Z @given( 2025-05-07T20:32:14.5996461Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.5996780Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.5997088Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.5997452Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.5997789Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.5998084Z ) 2025-05-07T20:32:14.5998436Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.5998890Z def test_silu_mul_quant( 2025-05-07T20:32:14.5999134Z self, 2025-05-07T20:32:14.5999600Z T: int, 2025-05-07T20:32:14.5999792Z D: int, 2025-05-07T20:32:14.6000010Z scale_ub: Optional[float], 2025-05-07T20:32:14.6000286Z contiguous: bool, 2025-05-07T20:32:14.6000523Z compiled: bool, 2025-05-07T20:32:14.6000757Z ) -> None: 2025-05-07T20:32:14.6000977Z torch.manual_seed(2025) 2025-05-07T20:32:14.6001306Z 2025-05-07T20:32:14.6001583Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6001930Z 2025-05-07T20:32:14.6002120Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6002421Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6002739Z x = x_sign * x_clamp 2025-05-07T20:32:14.6002979Z x0 = x[:, :D] 2025-05-07T20:32:14.6003199Z x1 = x[:, D:] 2025-05-07T20:32:14.6003414Z 2025-05-07T20:32:14.6003598Z if contiguous: 2025-05-07T20:32:14.6003833Z x0 = x0.contiguous() 2025-05-07T20:32:14.6004098Z x1 = x1.contiguous() 2025-05-07T20:32:14.6004338Z 2025-05-07T20:32:14.6004534Z if scale_ub is not None: 2025-05-07T20:32:14.6004813Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6005158Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6005466Z ) 2025-05-07T20:32:14.6005667Z else: 2025-05-07T20:32:14.6005879Z scale_ub_tensor = None 2025-05-07T20:32:14.6006129Z 2025-05-07T20:32:14.6006645Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6006969Z op = silu_mul_quant 2025-05-07T20:32:14.6007227Z if compiled: 2025-05-07T20:32:14.6007481Z op = torch.compile(op) 2025-05-07T20:32:14.6007783Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6008058Z 2025-05-07T20:32:14.6008255Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6008420Z 2025-05-07T20:32:14.6008527Z moe/activation_test.py:117: 2025-05-07T20:32:14.6008832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6009173Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6009463Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6010041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.6010616Z return fn(*args, **kwargs) 2025-05-07T20:32:14.6011295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6012257Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6012807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6013512Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6014201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6014753Z kernel = self.compile( 2025-05-07T20:32:14.6015310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6015994Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6016405Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6016642Z 2025-05-07T20:32:14.6016860Z self = 2025-05-07T20:32:14.6017980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6019417Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f94f7e7c860>} 2025-05-07T20:32:14.6020877Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6021940Z context = 2025-05-07T20:32:14.6022306Z 2025-05-07T20:32:14.6022482Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6023015Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6023498Z module_map=module_map) 2025-05-07T20:32:14.6023872Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6024227Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6024494Z E ^ 2025-05-07T20:32:14.6024971Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6025438Z 2025-05-07T20:32:14.6025872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6026401Z 2025-05-07T20:32:14.6026508Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6026933Z self=, 2025-05-07T20:32:14.6027349Z T=128, 2025-05-07T20:32:14.6027537Z D=5120, 2025-05-07T20:32:14.6027732Z scale_ub=1200.0, 2025-05-07T20:32:14.6027961Z contiguous=False, 2025-05-07T20:32:14.6028184Z compiled=True, 2025-05-07T20:32:14.6028398Z ) 2025-05-07T20:32:14.8815578Z self = 2025-05-07T20:32:14.8816197Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:14.8816478Z 2025-05-07T20:32:14.8816561Z @given( 2025-05-07T20:32:14.8816826Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.8817171Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.8817481Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.8817816Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.8818143Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.8818433Z ) 2025-05-07T20:32:14.8818797Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.8819246Z def test_silu_mul_quant( 2025-05-07T20:32:14.8819497Z self, 2025-05-07T20:32:14.8819699Z T: int, 2025-05-07T20:32:14.8820279Z D: int, 2025-05-07T20:32:14.8820511Z scale_ub: Optional[float], 2025-05-07T20:32:14.8820789Z contiguous: bool, 2025-05-07T20:32:14.8821035Z compiled: bool, 2025-05-07T20:32:14.8821258Z ) -> None: 2025-05-07T20:32:14.8821477Z torch.manual_seed(2025) 2025-05-07T20:32:14.8821724Z 2025-05-07T20:32:14.8822001Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.8822356Z 2025-05-07T20:32:14.8822559Z x_sign = torch.sign(x) 2025-05-07T20:32:14.8822851Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.8823170Z x = x_sign * x_clamp 2025-05-07T20:32:14.8823418Z x0 = x[:, :D] 2025-05-07T20:32:14.8823633Z x1 = x[:, D:] 2025-05-07T20:32:14.8823856Z 2025-05-07T20:32:14.8824049Z if contiguous: 2025-05-07T20:32:14.8824280Z x0 = x0.contiguous() 2025-05-07T20:32:14.8824544Z x1 = x1.contiguous() 2025-05-07T20:32:14.8824792Z 2025-05-07T20:32:14.8824986Z if scale_ub is not None: 2025-05-07T20:32:14.8825265Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.8825608Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.8825923Z ) 2025-05-07T20:32:14.8826117Z else: 2025-05-07T20:32:14.8826334Z scale_ub_tensor = None 2025-05-07T20:32:14.8826591Z 2025-05-07T20:32:14.8826937Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.8827258Z op = silu_mul_quant 2025-05-07T20:32:14.8827517Z if compiled: 2025-05-07T20:32:14.8827763Z op = torch.compile(op) 2025-05-07T20:32:14.8828069Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.8828435Z 2025-05-07T20:32:14.8828631Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.8828807Z 2025-05-07T20:32:14.8828909Z moe/activation_test.py:117: 2025-05-07T20:32:14.8829222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.8829557Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.8829846Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.8830421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.8830996Z return fn(*args, **kwargs) 2025-05-07T20:32:14.8831668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.8832381Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.8832937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.8833642Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.8834326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.8834876Z kernel = self.compile( 2025-05-07T20:32:14.8835438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.8836114Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.8836528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.8836770Z 2025-05-07T20:32:14.8836985Z self = 2025-05-07T20:32:14.8838105Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.8839543Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f94f7e7d580>} 2025-05-07T20:32:14.8841023Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.8842089Z context = 2025-05-07T20:32:14.8842384Z 2025-05-07T20:32:14.8842563Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.8843106Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.8843582Z module_map=module_map) 2025-05-07T20:32:14.8843958Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.8844323Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.8844584Z E ^ 2025-05-07T20:32:14.8845067Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.8845531Z 2025-05-07T20:32:14.8845975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.8846504Z 2025-05-07T20:32:14.8846619Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.8847089Z self=, 2025-05-07T20:32:14.8847510Z T=16384, 2025-05-07T20:32:14.8847716Z D=7168, 2025-05-07T20:32:14.8847963Z scale_ub=1200.0, 2025-05-07T20:32:14.8848195Z contiguous=True, 2025-05-07T20:32:14.8848422Z compiled=True, 2025-05-07T20:32:14.8848629Z ) 2025-05-07T20:32:14.8848959Z self = 2025-05-07T20:32:14.8849472Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.8849819Z 2025-05-07T20:32:14.8849905Z @given( 2025-05-07T20:32:14.8850134Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.8850454Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.8850777Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.8851107Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.8851445Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.8851741Z ) 2025-05-07T20:32:14.8852185Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.8852645Z def test_silu_mul_quant( 2025-05-07T20:32:14.8852894Z self, 2025-05-07T20:32:14.8853091Z T: int, 2025-05-07T20:32:14.8853297Z D: int, 2025-05-07T20:32:14.8853523Z scale_ub: Optional[float], 2025-05-07T20:32:14.8853795Z contiguous: bool, 2025-05-07T20:32:14.8854043Z compiled: bool, 2025-05-07T20:32:14.8854278Z ) -> None: 2025-05-07T20:32:14.8854500Z torch.manual_seed(2025) 2025-05-07T20:32:14.8854742Z 2025-05-07T20:32:14.8855023Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.8855370Z 2025-05-07T20:32:14.8855570Z x_sign = torch.sign(x) 2025-05-07T20:32:14.8855869Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.8856187Z x = x_sign * x_clamp 2025-05-07T20:32:14.8856432Z x0 = x[:, :D] 2025-05-07T20:32:14.8856658Z x1 = x[:, D:] 2025-05-07T20:32:14.8856871Z 2025-05-07T20:32:14.8857059Z if contiguous: 2025-05-07T20:32:14.8857301Z x0 = x0.contiguous() 2025-05-07T20:32:14.8857569Z x1 = x1.contiguous() 2025-05-07T20:32:14.8857811Z 2025-05-07T20:32:14.8858017Z if scale_ub is not None: 2025-05-07T20:32:14.8858299Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.8858636Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.8858957Z ) 2025-05-07T20:32:14.8859158Z else: 2025-05-07T20:32:14.8859369Z scale_ub_tensor = None 2025-05-07T20:32:14.8859627Z 2025-05-07T20:32:14.8859962Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.8860293Z op = silu_mul_quant 2025-05-07T20:32:14.8860547Z if compiled: 2025-05-07T20:32:14.8860804Z op = torch.compile(op) 2025-05-07T20:32:14.8861107Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.8861386Z 2025-05-07T20:32:14.8861585Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.8861752Z 2025-05-07T20:32:14.8861861Z moe/activation_test.py:117: 2025-05-07T20:32:14.8862159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.8862497Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.8862783Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.8863351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.8863928Z return fn(*args, **kwargs) 2025-05-07T20:32:14.8864612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.8865321Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.8865868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.8866572Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.8867260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.8867859Z kernel = self.compile( 2025-05-07T20:32:14.8868408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.8869090Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.8869548Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.8869784Z 2025-05-07T20:32:14.8870002Z self = 2025-05-07T20:32:14.8871115Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.8872538Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f94f7e7e0c0>} 2025-05-07T20:32:14.8873940Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.8875008Z context = 2025-05-07T20:32:14.8875304Z 2025-05-07T20:32:14.8875475Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.8876022Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.8876509Z module_map=module_map) 2025-05-07T20:32:14.8876931Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.8877289Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.8877557Z E ^ 2025-05-07T20:32:14.8878040Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.8878505Z 2025-05-07T20:32:14.8878936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.8879475Z 2025-05-07T20:32:15.0097694Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.0098371Z self=, 2025-05-07T20:32:15.0099004Z T=16384, 2025-05-07T20:32:15.0099272Z D=5120, 2025-05-07T20:32:15.0099541Z scale_ub=1200.0, 2025-05-07T20:32:15.0100152Z contiguous=True, 2025-05-07T20:32:15.0100381Z compiled=False, 2025-05-07T20:32:15.0100596Z ) 2025-05-07T20:32:15.0100922Z self = 2025-05-07T20:32:15.0101435Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:15.0101727Z 2025-05-07T20:32:15.0101808Z @given( 2025-05-07T20:32:15.0102055Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.0102371Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.0102685Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.0103024Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.0103364Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.0103655Z ) 2025-05-07T20:32:15.0104014Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.0104467Z def test_silu_mul_quant( 2025-05-07T20:32:15.0104715Z self, 2025-05-07T20:32:15.0104913Z T: int, 2025-05-07T20:32:15.0105113Z D: int, 2025-05-07T20:32:15.0105331Z scale_ub: Optional[float], 2025-05-07T20:32:15.0105611Z contiguous: bool, 2025-05-07T20:32:15.0105857Z compiled: bool, 2025-05-07T20:32:15.0106084Z ) -> None: 2025-05-07T20:32:15.0106567Z torch.manual_seed(2025) 2025-05-07T20:32:15.0106856Z 2025-05-07T20:32:15.0107233Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.0107586Z 2025-05-07T20:32:15.0107786Z x_sign = torch.sign(x) 2025-05-07T20:32:15.0108079Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.0108400Z x = x_sign * x_clamp 2025-05-07T20:32:15.0108725Z x0 = x[:, :D] 2025-05-07T20:32:15.0108958Z x1 = x[:, D:] 2025-05-07T20:32:15.0109171Z 2025-05-07T20:32:15.0109355Z if contiguous: 2025-05-07T20:32:15.0109591Z x0 = x0.contiguous() 2025-05-07T20:32:15.0109864Z x1 = x1.contiguous() 2025-05-07T20:32:15.0110102Z 2025-05-07T20:32:15.0110299Z if scale_ub is not None: 2025-05-07T20:32:15.0110578Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.0110917Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.0111234Z ) 2025-05-07T20:32:15.0111432Z else: 2025-05-07T20:32:15.0111645Z scale_ub_tensor = None 2025-05-07T20:32:15.0111905Z 2025-05-07T20:32:15.0112143Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.0112468Z op = silu_mul_quant 2025-05-07T20:32:15.0112717Z if compiled: 2025-05-07T20:32:15.0112969Z op = torch.compile(op) 2025-05-07T20:32:15.0113274Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.0113549Z 2025-05-07T20:32:15.0113743Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.0113912Z 2025-05-07T20:32:15.0114095Z moe/activation_test.py:117: 2025-05-07T20:32:15.0114672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.0124170Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.0124592Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.0125323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.0126051Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.0126604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.0127304Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.0127998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.0128551Z kernel = self.compile( 2025-05-07T20:32:15.0129291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.0129980Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.0130397Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.0130636Z 2025-05-07T20:32:15.0130848Z self = 2025-05-07T20:32:15.0132050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.0133487Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f94f7e7f1a0>} 2025-05-07T20:32:15.0134888Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.0135951Z context = 2025-05-07T20:32:15.0136248Z 2025-05-07T20:32:15.0136421Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.0136962Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.0137499Z module_map=module_map) 2025-05-07T20:32:15.0137869Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.0138234Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.0138502Z E ^ 2025-05-07T20:32:15.0138986Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.0139498Z 2025-05-07T20:32:15.0139932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.0140474Z 2025-05-07T20:32:15.0140579Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.0141008Z self=, 2025-05-07T20:32:15.0141416Z T=1, 2025-05-07T20:32:15.0141605Z D=7168, 2025-05-07T20:32:15.0141804Z scale_ub=1200.0, 2025-05-07T20:32:15.0142026Z contiguous=False, 2025-05-07T20:32:15.0142255Z compiled=False, 2025-05-07T20:32:15.0142467Z ) 2025-05-07T20:32:15.0142792Z self = 2025-05-07T20:32:15.0143294Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:15.0143576Z 2025-05-07T20:32:15.0143656Z @given( 2025-05-07T20:32:15.0143892Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.0144210Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.0144525Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.0144870Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.0145200Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.0145496Z ) 2025-05-07T20:32:15.0145855Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.0146313Z def test_silu_mul_quant( 2025-05-07T20:32:15.0146559Z self, 2025-05-07T20:32:15.0146786Z T: int, 2025-05-07T20:32:15.0147019Z D: int, 2025-05-07T20:32:15.0147238Z scale_ub: Optional[float], 2025-05-07T20:32:15.0147517Z contiguous: bool, 2025-05-07T20:32:15.0147763Z compiled: bool, 2025-05-07T20:32:15.0147988Z ) -> None: 2025-05-07T20:32:15.0148210Z torch.manual_seed(2025) 2025-05-07T20:32:15.0148459Z 2025-05-07T20:32:15.0148737Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.0149090Z 2025-05-07T20:32:15.0149290Z x_sign = torch.sign(x) 2025-05-07T20:32:15.0149673Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.0149995Z x = x_sign * x_clamp 2025-05-07T20:32:15.0150244Z x0 = x[:, :D] 2025-05-07T20:32:15.0150462Z x1 = x[:, D:] 2025-05-07T20:32:15.0150678Z 2025-05-07T20:32:15.0150868Z if contiguous: 2025-05-07T20:32:15.0151102Z x0 = x0.contiguous() 2025-05-07T20:32:15.0151376Z x1 = x1.contiguous() 2025-05-07T20:32:15.0151625Z 2025-05-07T20:32:15.0151827Z if scale_ub is not None: 2025-05-07T20:32:15.0152100Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.0152438Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.0152752Z ) 2025-05-07T20:32:15.0152947Z else: 2025-05-07T20:32:15.0153164Z scale_ub_tensor = None 2025-05-07T20:32:15.0153425Z 2025-05-07T20:32:15.0153657Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.0153983Z op = silu_mul_quant 2025-05-07T20:32:15.0154239Z if compiled: 2025-05-07T20:32:15.0154490Z op = torch.compile(op) 2025-05-07T20:32:15.0154794Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.0155074Z 2025-05-07T20:32:15.0155266Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.0155438Z 2025-05-07T20:32:15.0155539Z moe/activation_test.py:117: 2025-05-07T20:32:15.0155848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.0156241Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.0156524Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.0157286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.0158040Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.0158588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.0159298Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.0159986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.0160540Z kernel = self.compile( 2025-05-07T20:32:15.0161092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.0161772Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.0162184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.0162420Z 2025-05-07T20:32:15.0162638Z self = 2025-05-07T20:32:15.0163749Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.0165182Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9638bc0680>} 2025-05-07T20:32:15.0166580Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.0167696Z context = 2025-05-07T20:32:15.0167993Z 2025-05-07T20:32:15.0168182Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.0168715Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.0169201Z module_map=module_map) 2025-05-07T20:32:15.0169576Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.0169934Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.0170201Z E ^ 2025-05-07T20:32:15.0170764Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.0171235Z 2025-05-07T20:32:15.0171674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.0172287Z 2025-05-07T20:32:15.1906056Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.1907136Z self=, 2025-05-07T20:32:15.1907698Z T=4096, 2025-05-07T20:32:15.1907892Z D=7168, 2025-05-07T20:32:15.1908077Z scale_ub=1200.0, 2025-05-07T20:32:15.1908305Z contiguous=False, 2025-05-07T20:32:15.1908535Z compiled=True, 2025-05-07T20:32:15.1908753Z ) 2025-05-07T20:32:15.1909084Z self = 2025-05-07T20:32:15.1909603Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:15.1909900Z 2025-05-07T20:32:15.1909989Z @given( 2025-05-07T20:32:15.1910222Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.1910545Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.1910860Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.1911195Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.1911533Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.1912120Z ) 2025-05-07T20:32:15.1912475Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.1912932Z def test_silu_mul_quant( 2025-05-07T20:32:15.1913182Z self, 2025-05-07T20:32:15.1913382Z T: int, 2025-05-07T20:32:15.1913680Z D: int, 2025-05-07T20:32:15.1913906Z scale_ub: Optional[float], 2025-05-07T20:32:15.1914184Z contiguous: bool, 2025-05-07T20:32:15.1914424Z compiled: bool, 2025-05-07T20:32:15.1914659Z ) -> None: 2025-05-07T20:32:15.1914885Z torch.manual_seed(2025) 2025-05-07T20:32:15.1915128Z 2025-05-07T20:32:15.1915409Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.1915761Z 2025-05-07T20:32:15.1915955Z x_sign = torch.sign(x) 2025-05-07T20:32:15.1916252Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.1916571Z x = x_sign * x_clamp 2025-05-07T20:32:15.1916816Z x0 = x[:, :D] 2025-05-07T20:32:15.1917043Z x1 = x[:, D:] 2025-05-07T20:32:15.1917256Z 2025-05-07T20:32:15.1917440Z if contiguous: 2025-05-07T20:32:15.1917680Z x0 = x0.contiguous() 2025-05-07T20:32:15.1917945Z x1 = x1.contiguous() 2025-05-07T20:32:15.1918189Z 2025-05-07T20:32:15.1918391Z if scale_ub is not None: 2025-05-07T20:32:15.1918670Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.1919015Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.1919330Z ) 2025-05-07T20:32:15.1919530Z else: 2025-05-07T20:32:15.1919748Z scale_ub_tensor = None 2025-05-07T20:32:15.1920000Z 2025-05-07T20:32:15.1920241Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.1920565Z op = silu_mul_quant 2025-05-07T20:32:15.1920816Z if compiled: 2025-05-07T20:32:15.1921073Z op = torch.compile(op) 2025-05-07T20:32:15.1921382Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.1921657Z 2025-05-07T20:32:15.1921858Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.1922024Z 2025-05-07T20:32:15.1922134Z moe/activation_test.py:117: 2025-05-07T20:32:15.1922436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.1922788Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.1923084Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.1923821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:15.1924399Z return fn(*args, **kwargs) 2025-05-07T20:32:15.1925078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.1925792Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.1926341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.1927045Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.1927730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.1928278Z kernel = self.compile( 2025-05-07T20:32:15.1928858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.1929532Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.1929951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.1930190Z 2025-05-07T20:32:15.1930416Z self = 2025-05-07T20:32:15.1931543Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.1933203Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9638bc1940>} 2025-05-07T20:32:15.1934626Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.1935764Z context = 2025-05-07T20:32:15.1936064Z 2025-05-07T20:32:15.1936243Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.1936834Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.1937329Z module_map=module_map) 2025-05-07T20:32:15.1937707Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.1938079Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.1938345Z E ^ 2025-05-07T20:32:15.1938831Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.1939302Z 2025-05-07T20:32:15.1939746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.1940288Z 2025-05-07T20:32:15.1940402Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.1940839Z self=, 2025-05-07T20:32:15.1941261Z T=128, 2025-05-07T20:32:15.1941454Z D=7168, 2025-05-07T20:32:15.1941645Z scale_ub=1200.0, 2025-05-07T20:32:15.1941878Z contiguous=False, 2025-05-07T20:32:15.1942115Z compiled=True, 2025-05-07T20:32:15.1942317Z ) 2025-05-07T20:32:15.2850721Z self = 2025-05-07T20:32:15.2852329Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:15.2853103Z 2025-05-07T20:32:15.2853323Z @given( 2025-05-07T20:32:15.2853783Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.2854421Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.2855053Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.2855711Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.2856376Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.2857269Z ) 2025-05-07T20:32:15.2857637Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.2858098Z def test_silu_mul_quant( 2025-05-07T20:32:15.2858348Z self, 2025-05-07T20:32:15.2858579Z T: int, 2025-05-07T20:32:15.2858787Z D: int, 2025-05-07T20:32:15.2859012Z scale_ub: Optional[float], 2025-05-07T20:32:15.2859293Z contiguous: bool, 2025-05-07T20:32:15.2859546Z compiled: bool, 2025-05-07T20:32:15.2859777Z ) -> None: 2025-05-07T20:32:15.2859999Z torch.manual_seed(2025) 2025-05-07T20:32:15.2860250Z 2025-05-07T20:32:15.2860527Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.2860882Z 2025-05-07T20:32:15.2861088Z x_sign = torch.sign(x) 2025-05-07T20:32:15.2861387Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.2861709Z x = x_sign * x_clamp 2025-05-07T20:32:15.2861959Z x0 = x[:, :D] 2025-05-07T20:32:15.2862188Z x1 = x[:, D:] 2025-05-07T20:32:15.2862396Z 2025-05-07T20:32:15.2862588Z if contiguous: 2025-05-07T20:32:15.2862828Z x0 = x0.contiguous() 2025-05-07T20:32:15.2863090Z x1 = x1.contiguous() 2025-05-07T20:32:15.2863338Z 2025-05-07T20:32:15.2863536Z if scale_ub is not None: 2025-05-07T20:32:15.2863810Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.2864276Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.2864601Z ) 2025-05-07T20:32:15.2864795Z else: 2025-05-07T20:32:15.2865015Z scale_ub_tensor = None 2025-05-07T20:32:15.2865275Z 2025-05-07T20:32:15.2865510Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.2865920Z op = silu_mul_quant 2025-05-07T20:32:15.2866178Z if compiled: 2025-05-07T20:32:15.2866425Z op = torch.compile(op) 2025-05-07T20:32:15.2866763Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.2867074Z 2025-05-07T20:32:15.2867272Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.2867442Z 2025-05-07T20:32:15.2867544Z moe/activation_test.py:117: 2025-05-07T20:32:15.2867849Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.2868190Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.2868474Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.2869064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:15.2869649Z return fn(*args, **kwargs) 2025-05-07T20:32:15.2870329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.2871054Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.2871616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.2872344Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.2873038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.2873599Z kernel = self.compile( 2025-05-07T20:32:15.2874165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.2874856Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.2875266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.2875511Z 2025-05-07T20:32:15.2875726Z self = 2025-05-07T20:32:15.2876956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.2878434Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9638bc2700>} 2025-05-07T20:32:15.2879845Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.2880928Z context = 2025-05-07T20:32:15.2881234Z 2025-05-07T20:32:15.2881407Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.2881956Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.2882442Z module_map=module_map) 2025-05-07T20:32:15.2882824Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.2883201Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.2883464Z E ^ 2025-05-07T20:32:15.2883958Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.2884439Z 2025-05-07T20:32:15.2884883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.2885470Z 2025-05-07T20:32:15.2885585Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.2886011Z self=, 2025-05-07T20:32:15.2886438Z T=2048, 2025-05-07T20:32:15.2886635Z D=7168, 2025-05-07T20:32:15.2886829Z scale_ub=None, 2025-05-07T20:32:15.2887097Z contiguous=True, 2025-05-07T20:32:15.2887330Z compiled=True, 2025-05-07T20:32:15.2887544Z ) 2025-05-07T20:32:15.2887871Z self = 2025-05-07T20:32:15.2888400Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:15.2888680Z 2025-05-07T20:32:15.2888766Z @given( 2025-05-07T20:32:15.2889000Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.2889330Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.2889651Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.2889985Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.2890329Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.2890623Z ) 2025-05-07T20:32:15.2890987Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.2891444Z def test_silu_mul_quant( 2025-05-07T20:32:15.2891694Z self, 2025-05-07T20:32:15.2891972Z T: int, 2025-05-07T20:32:15.2892173Z D: int, 2025-05-07T20:32:15.2892402Z scale_ub: Optional[float], 2025-05-07T20:32:15.2892685Z contiguous: bool, 2025-05-07T20:32:15.2892934Z compiled: bool, 2025-05-07T20:32:15.2893165Z ) -> None: 2025-05-07T20:32:15.2893389Z torch.manual_seed(2025) 2025-05-07T20:32:15.2893636Z 2025-05-07T20:32:15.2893918Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.2894277Z 2025-05-07T20:32:15.2894473Z x_sign = torch.sign(x) 2025-05-07T20:32:15.2894774Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.2895099Z x = x_sign * x_clamp 2025-05-07T20:32:15.2895342Z x0 = x[:, :D] 2025-05-07T20:32:15.2895566Z x1 = x[:, D:] 2025-05-07T20:32:15.2895782Z 2025-05-07T20:32:15.2895973Z if contiguous: 2025-05-07T20:32:15.2896211Z x0 = x0.contiguous() 2025-05-07T20:32:15.2896486Z x1 = x1.contiguous() 2025-05-07T20:32:15.2896737Z 2025-05-07T20:32:15.2896955Z if scale_ub is not None: 2025-05-07T20:32:15.2897259Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.2897702Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.2898017Z ) 2025-05-07T20:32:15.2898220Z else: 2025-05-07T20:32:15.2898441Z scale_ub_tensor = None 2025-05-07T20:32:15.2898696Z 2025-05-07T20:32:15.2898937Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.2899267Z op = silu_mul_quant 2025-05-07T20:32:15.2899527Z if compiled: 2025-05-07T20:32:15.2899792Z op = torch.compile(op) 2025-05-07T20:32:15.2900101Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.2900381Z 2025-05-07T20:32:15.2900582Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.2900752Z 2025-05-07T20:32:15.2900864Z moe/activation_test.py:117: 2025-05-07T20:32:15.2901185Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.2901532Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.2901828Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.2902418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:15.2903001Z return fn(*args, **kwargs) 2025-05-07T20:32:15.2903698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.2904423Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.2905039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.2905748Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.2906728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.2907390Z kernel = self.compile( 2025-05-07T20:32:15.2907952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.2908652Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.2909065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.2909304Z 2025-05-07T20:32:15.2909525Z self = 2025-05-07T20:32:15.2910660Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.2912111Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9638bc37e0>} 2025-05-07T20:32:15.2913534Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.2914624Z context = 2025-05-07T20:32:15.2914926Z 2025-05-07T20:32:15.2915103Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.2915648Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.2916138Z module_map=module_map) 2025-05-07T20:32:15.2916524Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.2916887Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.2917150Z E ^ 2025-05-07T20:32:15.2917632Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.2918108Z 2025-05-07T20:32:15.2918549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.2919080Z 2025-05-07T20:32:15.3578777Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.3579449Z self=, 2025-05-07T20:32:15.3580014Z T=16384, 2025-05-07T20:32:15.3580290Z D=5120, 2025-05-07T20:32:15.3580505Z scale_ub=None, 2025-05-07T20:32:15.3580726Z contiguous=False, 2025-05-07T20:32:15.3580973Z compiled=False, 2025-05-07T20:32:15.3581176Z ) 2025-05-07T20:32:15.3581514Z self = 2025-05-07T20:32:15.3582035Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:15.3582323Z 2025-05-07T20:32:15.3582409Z @given( 2025-05-07T20:32:15.3582639Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.3582969Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.3583288Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.3583620Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.3583962Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.3584257Z ) 2025-05-07T20:32:15.3584611Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.3585069Z def test_silu_mul_quant( 2025-05-07T20:32:15.3585318Z self, 2025-05-07T20:32:15.3585517Z T: int, 2025-05-07T20:32:15.3585711Z D: int, 2025-05-07T20:32:15.3586021Z scale_ub: Optional[float], 2025-05-07T20:32:15.3596404Z contiguous: bool, 2025-05-07T20:32:15.3596794Z compiled: bool, 2025-05-07T20:32:15.3597040Z ) -> None: 2025-05-07T20:32:15.3597265Z torch.manual_seed(2025) 2025-05-07T20:32:15.3597511Z 2025-05-07T20:32:15.3597793Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.3598300Z 2025-05-07T20:32:15.3598493Z x_sign = torch.sign(x) 2025-05-07T20:32:15.3598794Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.3600911Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.3602891Z 2025-05-07T20:32:15.3603015Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:15.3603241Z 2025-05-07T20:32:15.3603347Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.3603775Z self=, 2025-05-07T20:32:15.3604187Z T=4096, 2025-05-07T20:32:15.3604380Z D=7168, 2025-05-07T20:32:15.3604575Z scale_ub=1200.0, 2025-05-07T20:32:15.3604806Z contiguous=True, 2025-05-07T20:32:15.3605030Z compiled=True, 2025-05-07T20:32:15.3605240Z ) 2025-05-07T20:32:15.3605561Z self = 2025-05-07T20:32:15.3606072Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:15.3606660Z 2025-05-07T20:32:15.3606761Z @given( 2025-05-07T20:32:15.3607026Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.3607338Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.3607652Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.3607987Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.3608315Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.3608608Z ) 2025-05-07T20:32:15.3608965Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.3609411Z def test_silu_mul_quant( 2025-05-07T20:32:15.3609656Z self, 2025-05-07T20:32:15.3609991Z T: int, 2025-05-07T20:32:15.3610190Z D: int, 2025-05-07T20:32:15.3610411Z scale_ub: Optional[float], 2025-05-07T20:32:15.3610685Z contiguous: bool, 2025-05-07T20:32:15.3610928Z compiled: bool, 2025-05-07T20:32:15.3611149Z ) -> None: 2025-05-07T20:32:15.3611370Z torch.manual_seed(2025) 2025-05-07T20:32:15.3611617Z 2025-05-07T20:32:15.3611972Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.3612323Z 2025-05-07T20:32:15.3612519Z x_sign = torch.sign(x) 2025-05-07T20:32:15.3612808Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.3614902Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.3616891Z 2025-05-07T20:32:15.3617035Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:15.3617255Z 2025-05-07T20:32:15.3617434Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.3617855Z self=, 2025-05-07T20:32:15.3618263Z T=16384, 2025-05-07T20:32:15.3618458Z D=7168, 2025-05-07T20:32:15.3618652Z scale_ub=None, 2025-05-07T20:32:15.3618865Z contiguous=False, 2025-05-07T20:32:15.3619162Z compiled=False, 2025-05-07T20:32:15.3619369Z ) 2025-05-07T20:32:15.3619689Z self = 2025-05-07T20:32:15.3620210Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:15.3620496Z 2025-05-07T20:32:15.3620585Z @given( 2025-05-07T20:32:15.3620811Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.3621132Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.3621444Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.3621783Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.3622116Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.3622410Z ) 2025-05-07T20:32:15.3622787Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.3623246Z def test_silu_mul_quant( 2025-05-07T20:32:15.3623486Z self, 2025-05-07T20:32:15.3623686Z T: int, 2025-05-07T20:32:15.3623891Z D: int, 2025-05-07T20:32:15.3624105Z scale_ub: Optional[float], 2025-05-07T20:32:15.3624377Z contiguous: bool, 2025-05-07T20:32:15.3624622Z compiled: bool, 2025-05-07T20:32:15.3624847Z ) -> None: 2025-05-07T20:32:15.3625064Z torch.manual_seed(2025) 2025-05-07T20:32:15.3625307Z 2025-05-07T20:32:15.3625576Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.3627712Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.3629670Z 2025-05-07T20:32:15.3629790Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:15.3630008Z 2025-05-07T20:32:15.3630111Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.3630623Z self=, 2025-05-07T20:32:15.3631035Z T=2048, 2025-05-07T20:32:15.3631224Z D=7168, 2025-05-07T20:32:15.3631420Z scale_ub=1200.0, 2025-05-07T20:32:15.3631639Z contiguous=True, 2025-05-07T20:32:15.3631863Z compiled=True, 2025-05-07T20:32:15.3632069Z ) 2025-05-07T20:32:15.3632391Z self = 2025-05-07T20:32:15.3632909Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:15.3633189Z 2025-05-07T20:32:15.3633274Z @given( 2025-05-07T20:32:15.3633502Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.3633820Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.3634136Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.3634475Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.3634811Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.3635103Z ) 2025-05-07T20:32:15.3635460Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.3635908Z def test_silu_mul_quant( 2025-05-07T20:32:15.3636156Z self, 2025-05-07T20:32:15.3636354Z T: int, 2025-05-07T20:32:15.3636550Z D: int, 2025-05-07T20:32:15.3636772Z scale_ub: Optional[float], 2025-05-07T20:32:15.3637102Z contiguous: bool, 2025-05-07T20:32:15.3637341Z compiled: bool, 2025-05-07T20:32:15.3637567Z ) -> None: 2025-05-07T20:32:15.3637786Z torch.manual_seed(2025) 2025-05-07T20:32:15.3638031Z 2025-05-07T20:32:15.3638308Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.3638704Z 2025-05-07T20:32:15.3638901Z x_sign = torch.sign(x) 2025-05-07T20:32:15.3639193Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.3641275Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.3643215Z 2025-05-07T20:32:15.3643336Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:15.3643551Z 2025-05-07T20:32:15.3643660Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.3644082Z self=, 2025-05-07T20:32:15.3644499Z T=2048, 2025-05-07T20:32:15.3644693Z D=7168, 2025-05-07T20:32:15.3644893Z scale_ub=None, 2025-05-07T20:32:15.3645105Z contiguous=True, 2025-05-07T20:32:15.3645338Z compiled=False, 2025-05-07T20:32:15.3645545Z ) 2025-05-07T20:32:15.4758749Z self = 2025-05-07T20:32:15.4759523Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:15.4759915Z 2025-05-07T20:32:15.4760022Z @given( 2025-05-07T20:32:15.4760315Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.4760639Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.4760951Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.4761292Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.4761628Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.4761917Z ) 2025-05-07T20:32:15.4762284Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.4762737Z def test_silu_mul_quant( 2025-05-07T20:32:15.4762977Z self, 2025-05-07T20:32:15.4763397Z T: int, 2025-05-07T20:32:15.4763603Z D: int, 2025-05-07T20:32:15.4763822Z scale_ub: Optional[float], 2025-05-07T20:32:15.4764096Z contiguous: bool, 2025-05-07T20:32:15.4764343Z compiled: bool, 2025-05-07T20:32:15.4764571Z ) -> None: 2025-05-07T20:32:15.4764793Z torch.manual_seed(2025) 2025-05-07T20:32:15.4765043Z 2025-05-07T20:32:15.4765317Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.4765671Z 2025-05-07T20:32:15.4765871Z > x_sign = torch.sign(x) 2025-05-07T20:32:15.4767914Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.4769861Z 2025-05-07T20:32:15.4769993Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:15.4770213Z 2025-05-07T20:32:15.4770320Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.4770755Z self=, 2025-05-07T20:32:15.4771252Z T=1, 2025-05-07T20:32:15.4771436Z D=7168, 2025-05-07T20:32:15.4771642Z scale_ub=1200.0, 2025-05-07T20:32:15.4771962Z contiguous=True, 2025-05-07T20:32:15.4772186Z compiled=False, 2025-05-07T20:32:15.4772400Z ) 2025-05-07T20:32:15.4772730Z self = 2025-05-07T20:32:15.4773325Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:15.4773598Z 2025-05-07T20:32:15.4773678Z @given( 2025-05-07T20:32:15.4773920Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.4774245Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.4774573Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.4774919Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.4775263Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.4775553Z ) 2025-05-07T20:32:15.4775919Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.4776378Z def test_silu_mul_quant( 2025-05-07T20:32:15.4776623Z self, 2025-05-07T20:32:15.4776826Z T: int, 2025-05-07T20:32:15.4777034Z D: int, 2025-05-07T20:32:15.4777261Z scale_ub: Optional[float], 2025-05-07T20:32:15.4777540Z contiguous: bool, 2025-05-07T20:32:15.4777789Z compiled: bool, 2025-05-07T20:32:15.4778027Z ) -> None: 2025-05-07T20:32:15.4778246Z torch.manual_seed(2025) 2025-05-07T20:32:15.4778500Z 2025-05-07T20:32:15.4778791Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.4779143Z 2025-05-07T20:32:15.4779344Z x_sign = torch.sign(x) 2025-05-07T20:32:15.4779648Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.4779966Z x = x_sign * x_clamp 2025-05-07T20:32:15.4780214Z x0 = x[:, :D] 2025-05-07T20:32:15.4780440Z x1 = x[:, D:] 2025-05-07T20:32:15.4780652Z 2025-05-07T20:32:15.4780844Z if contiguous: 2025-05-07T20:32:15.4781092Z x0 = x0.contiguous() 2025-05-07T20:32:15.4781362Z x1 = x1.contiguous() 2025-05-07T20:32:15.4781614Z 2025-05-07T20:32:15.4781813Z if scale_ub is not None: 2025-05-07T20:32:15.4782090Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.4782441Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.4782765Z ) 2025-05-07T20:32:15.4782968Z else: 2025-05-07T20:32:15.4783303Z scale_ub_tensor = None 2025-05-07T20:32:15.4783567Z 2025-05-07T20:32:15.4783809Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.4784131Z op = silu_mul_quant 2025-05-07T20:32:15.4784390Z if compiled: 2025-05-07T20:32:15.4784646Z op = torch.compile(op) 2025-05-07T20:32:15.4784945Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.4785236Z 2025-05-07T20:32:15.4785437Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.4785606Z 2025-05-07T20:32:15.4785708Z moe/activation_test.py:117: 2025-05-07T20:32:15.4786019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.4786366Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.4786667Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.4787394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.4788132Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.4788707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.4789425Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.4790130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.4790748Z kernel = self.compile( 2025-05-07T20:32:15.4791320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.4792008Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.4792433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.4792716Z 2025-05-07T20:32:15.4792936Z self = 2025-05-07T20:32:15.4794074Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.4795501Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f94f7b3ab60>} 2025-05-07T20:32:15.4796912Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.4798005Z context = 2025-05-07T20:32:15.4798311Z 2025-05-07T20:32:15.4798492Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.4799039Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.4799541Z module_map=module_map) 2025-05-07T20:32:15.4799924Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.4800295Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.4800561Z E ^ 2025-05-07T20:32:15.4801054Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.4801534Z 2025-05-07T20:32:15.4801980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.4802526Z 2025-05-07T20:32:15.4802634Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.4803068Z self=, 2025-05-07T20:32:15.4803497Z T=128, 2025-05-07T20:32:15.4803692Z D=5120, 2025-05-07T20:32:15.4803886Z scale_ub=None, 2025-05-07T20:32:15.4804114Z contiguous=True, 2025-05-07T20:32:15.4804348Z compiled=False, 2025-05-07T20:32:15.4804639Z ) 2025-05-07T20:32:15.5479942Z self = 2025-05-07T20:32:15.5480714Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:15.5481060Z 2025-05-07T20:32:15.5481137Z @given( 2025-05-07T20:32:15.5481374Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.5481711Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.5482024Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.5482358Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.5482685Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.5482975Z ) 2025-05-07T20:32:15.5483329Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.5483786Z def test_silu_mul_quant( 2025-05-07T20:32:15.5484028Z self, 2025-05-07T20:32:15.5484225Z T: int, 2025-05-07T20:32:15.5484435Z D: int, 2025-05-07T20:32:15.5484655Z scale_ub: Optional[float], 2025-05-07T20:32:15.5484931Z contiguous: bool, 2025-05-07T20:32:15.5485172Z compiled: bool, 2025-05-07T20:32:15.5485396Z ) -> None: 2025-05-07T20:32:15.5485621Z torch.manual_seed(2025) 2025-05-07T20:32:15.5485865Z 2025-05-07T20:32:15.5486136Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.5486734Z 2025-05-07T20:32:15.5486931Z x_sign = torch.sign(x) 2025-05-07T20:32:15.5487223Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.5487541Z x = x_sign * x_clamp 2025-05-07T20:32:15.5487783Z x0 = x[:, :D] 2025-05-07T20:32:15.5487994Z x1 = x[:, D:] 2025-05-07T20:32:15.5488299Z 2025-05-07T20:32:15.5488485Z if contiguous: 2025-05-07T20:32:15.5488713Z x0 = x0.contiguous() 2025-05-07T20:32:15.5488979Z x1 = x1.contiguous() 2025-05-07T20:32:15.5489223Z 2025-05-07T20:32:15.5489424Z if scale_ub is not None: 2025-05-07T20:32:15.5489699Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.5490044Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.5490366Z ) 2025-05-07T20:32:15.5490560Z else: 2025-05-07T20:32:15.5490778Z scale_ub_tensor = None 2025-05-07T20:32:15.5491036Z 2025-05-07T20:32:15.5491270Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.5491597Z op = silu_mul_quant 2025-05-07T20:32:15.5491962Z if compiled: 2025-05-07T20:32:15.5492212Z op = torch.compile(op) 2025-05-07T20:32:15.5492518Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.5492805Z 2025-05-07T20:32:15.5492999Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.5493173Z 2025-05-07T20:32:15.5493274Z moe/activation_test.py:117: 2025-05-07T20:32:15.5493589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.5493935Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.5494224Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.5494945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.5495664Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.5496222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.5496938Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.5497631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.5498188Z kernel = self.compile( 2025-05-07T20:32:15.5498744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.5499583Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.5500003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.5500241Z 2025-05-07T20:32:15.5500462Z self = 2025-05-07T20:32:15.5501582Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.5503029Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f94f7b3bc40>} 2025-05-07T20:32:15.5504438Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.5505509Z context = 2025-05-07T20:32:15.5505808Z 2025-05-07T20:32:15.5505982Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.5506896Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.5507389Z module_map=module_map) 2025-05-07T20:32:15.5507859Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.5508221Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.5508494Z E ^ 2025-05-07T20:32:15.5508982Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.5509518Z 2025-05-07T20:32:15.5509956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.5510499Z 2025-05-07T20:32:15.5510607Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.5511046Z self=, 2025-05-07T20:32:15.5511472Z T=128, 2025-05-07T20:32:15.5511666Z D=7168, 2025-05-07T20:32:15.5511867Z scale_ub=None, 2025-05-07T20:32:15.5512097Z contiguous=True, 2025-05-07T20:32:15.5512326Z compiled=False, 2025-05-07T20:32:15.5512548Z ) 2025-05-07T20:32:15.5512884Z self = 2025-05-07T20:32:15.5513399Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:15.5513685Z 2025-05-07T20:32:15.5513766Z @given( 2025-05-07T20:32:15.5514006Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.5514338Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.5514654Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.5514999Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.5515348Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.5515636Z ) 2025-05-07T20:32:15.5516003Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.5516462Z def test_silu_mul_quant( 2025-05-07T20:32:15.5516707Z self, 2025-05-07T20:32:15.5516910Z T: int, 2025-05-07T20:32:15.5517115Z D: int, 2025-05-07T20:32:15.5517336Z scale_ub: Optional[float], 2025-05-07T20:32:15.5517621Z contiguous: bool, 2025-05-07T20:32:15.5517869Z compiled: bool, 2025-05-07T20:32:15.5518094Z ) -> None: 2025-05-07T20:32:15.5518320Z torch.manual_seed(2025) 2025-05-07T20:32:15.5518570Z 2025-05-07T20:32:15.5518858Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.5519211Z 2025-05-07T20:32:15.5519410Z x_sign = torch.sign(x) 2025-05-07T20:32:15.5519715Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.5520030Z x = x_sign * x_clamp 2025-05-07T20:32:15.5520406Z x0 = x[:, :D] 2025-05-07T20:32:15.5520635Z x1 = x[:, D:] 2025-05-07T20:32:15.5520847Z 2025-05-07T20:32:15.5521039Z if contiguous: 2025-05-07T20:32:15.5521281Z x0 = x0.contiguous() 2025-05-07T20:32:15.5521544Z x1 = x1.contiguous() 2025-05-07T20:32:15.5521793Z 2025-05-07T20:32:15.5522011Z if scale_ub is not None: 2025-05-07T20:32:15.5522292Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.5522642Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.5522967Z ) 2025-05-07T20:32:15.5523161Z else: 2025-05-07T20:32:15.5523381Z scale_ub_tensor = None 2025-05-07T20:32:15.5523642Z 2025-05-07T20:32:15.5523881Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.5524210Z op = silu_mul_quant 2025-05-07T20:32:15.5524472Z if compiled: 2025-05-07T20:32:15.5524730Z op = torch.compile(op) 2025-05-07T20:32:15.5525039Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.5525325Z 2025-05-07T20:32:15.5525526Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.5525697Z 2025-05-07T20:32:15.5525801Z moe/activation_test.py:117: 2025-05-07T20:32:15.5526113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.5526462Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.5526801Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.5527522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.5528239Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.5528799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.5541337Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.5542137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.5542692Z kernel = self.compile( 2025-05-07T20:32:15.5543250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.5543930Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.5544333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.5544574Z 2025-05-07T20:32:15.5544788Z self = 2025-05-07T20:32:15.5545908Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.5547338Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f94f7adcae0>} 2025-05-07T20:32:15.5548729Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.5549913Z context = 2025-05-07T20:32:15.5550215Z 2025-05-07T20:32:15.5550388Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.5550938Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.5551419Z module_map=module_map) 2025-05-07T20:32:15.5551801Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.5552163Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.5552425Z E ^ 2025-05-07T20:32:15.5553036Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.5553504Z 2025-05-07T20:32:15.5553936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.5554465Z 2025-05-07T20:32:15.5554577Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.5555013Z self=, 2025-05-07T20:32:15.5555428Z T=2048, 2025-05-07T20:32:15.5555622Z D=7168, 2025-05-07T20:32:15.5555820Z scale_ub=1200.0, 2025-05-07T20:32:15.5556045Z contiguous=True, 2025-05-07T20:32:15.5556273Z compiled=False, 2025-05-07T20:32:15.5556487Z ) 2025-05-07T20:32:15.6356228Z self = 2025-05-07T20:32:15.6357154Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:15.6357637Z 2025-05-07T20:32:15.6357758Z @given( 2025-05-07T20:32:15.6358128Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.6358630Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.6359122Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.6359661Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.6360198Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.6360677Z ) 2025-05-07T20:32:15.6361580Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.6362328Z def test_silu_mul_quant( 2025-05-07T20:32:15.6362720Z self, 2025-05-07T20:32:15.6363029Z T: int, 2025-05-07T20:32:15.6363336Z D: int, 2025-05-07T20:32:15.6363687Z scale_ub: Optional[float], 2025-05-07T20:32:15.6364245Z contiguous: bool, 2025-05-07T20:32:15.6364580Z compiled: bool, 2025-05-07T20:32:15.6364903Z ) -> None: 2025-05-07T20:32:15.6365222Z torch.manual_seed(2025) 2025-05-07T20:32:15.6365602Z 2025-05-07T20:32:15.6366037Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.6369679Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.6373439Z 2025-05-07T20:32:15.6373650Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:15.6374038Z 2025-05-07T20:32:15.6374206Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.6374937Z self=, 2025-05-07T20:32:15.6375630Z T=1, 2025-05-07T20:32:15.6375925Z D=5120, 2025-05-07T20:32:15.6376224Z scale_ub=1200.0, 2025-05-07T20:32:15.6376553Z contiguous=True, 2025-05-07T20:32:15.6376925Z compiled=False, 2025-05-07T20:32:15.6377263Z ) 2025-05-07T20:32:15.6377789Z self = 2025-05-07T20:32:15.6378591Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:15.6379034Z 2025-05-07T20:32:15.6379158Z @given( 2025-05-07T20:32:15.6379522Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.6380010Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.6380512Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.6381058Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.6381593Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.6382057Z ) 2025-05-07T20:32:15.6382885Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.6383642Z def test_silu_mul_quant( 2025-05-07T20:32:15.6384014Z self, 2025-05-07T20:32:15.6384323Z T: int, 2025-05-07T20:32:15.6384639Z D: int, 2025-05-07T20:32:15.6384987Z scale_ub: Optional[float], 2025-05-07T20:32:15.6385450Z contiguous: bool, 2025-05-07T20:32:15.6385832Z compiled: bool, 2025-05-07T20:32:15.6386183Z ) -> None: 2025-05-07T20:32:15.6386507Z torch.manual_seed(2025) 2025-05-07T20:32:15.6386932Z 2025-05-07T20:32:15.6387404Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.6387997Z 2025-05-07T20:32:15.6388308Z x_sign = torch.sign(x) 2025-05-07T20:32:15.6388788Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.6389333Z x = x_sign * x_clamp 2025-05-07T20:32:15.6389730Z x0 = x[:, :D] 2025-05-07T20:32:15.6390073Z x1 = x[:, D:] 2025-05-07T20:32:15.6390416Z 2025-05-07T20:32:15.6390723Z if contiguous: 2025-05-07T20:32:15.6391099Z x0 = x0.contiguous() 2025-05-07T20:32:15.6391532Z x1 = x1.contiguous() 2025-05-07T20:32:15.6391935Z 2025-05-07T20:32:15.6392245Z if scale_ub is not None: 2025-05-07T20:32:15.6392696Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.6393273Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.6393893Z ) 2025-05-07T20:32:15.6394198Z else: 2025-05-07T20:32:15.6394538Z scale_ub_tensor = None 2025-05-07T20:32:15.6394967Z 2025-05-07T20:32:15.6395340Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.6395884Z op = silu_mul_quant 2025-05-07T20:32:15.6396371Z if compiled: 2025-05-07T20:32:15.6396778Z op = torch.compile(op) 2025-05-07T20:32:15.6397280Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.6397752Z 2025-05-07T20:32:15.6398055Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.6398332Z 2025-05-07T20:32:15.6398486Z moe/activation_test.py:117: 2025-05-07T20:32:15.6398963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.6399506Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.6399958Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.6401200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.6402499Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.6403467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.6404733Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.6405964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.6407144Z kernel = self.compile( 2025-05-07T20:32:15.6408128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.6409352Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.6410060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.6410476Z 2025-05-07T20:32:15.6410847Z self = 2025-05-07T20:32:15.6412991Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.6415646Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f94f7ade0c0>} 2025-05-07T20:32:15.6418474Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.6420274Z context = 2025-05-07T20:32:15.6420695Z 2025-05-07T20:32:15.6420943Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.6421748Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.6422480Z module_map=module_map) 2025-05-07T20:32:15.6423006Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.6423501Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.6423869Z E ^ 2025-05-07T20:32:15.6424549Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.6425219Z 2025-05-07T20:32:15.6425851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.6426627Z 2025-05-07T20:32:15.6426769Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.6427421Z self=, 2025-05-07T20:32:15.6428011Z T=2048, 2025-05-07T20:32:15.6428261Z D=5120, 2025-05-07T20:32:15.6428656Z scale_ub=None, 2025-05-07T20:32:15.6428957Z contiguous=True, 2025-05-07T20:32:15.6429266Z compiled=False, 2025-05-07T20:32:15.6429559Z ) 2025-05-07T20:32:15.6430016Z self = 2025-05-07T20:32:15.6430723Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:15.6431226Z 2025-05-07T20:32:15.6431339Z @given( 2025-05-07T20:32:15.6431653Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.6432098Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.6432528Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.6432996Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.6433460Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.6433870Z ) 2025-05-07T20:32:15.6434372Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.6435017Z def test_silu_mul_quant( 2025-05-07T20:32:15.6435346Z self, 2025-05-07T20:32:15.6435629Z T: int, 2025-05-07T20:32:15.6435903Z D: int, 2025-05-07T20:32:15.6436195Z scale_ub: Optional[float], 2025-05-07T20:32:15.6436573Z contiguous: bool, 2025-05-07T20:32:15.6436903Z compiled: bool, 2025-05-07T20:32:15.6437207Z ) -> None: 2025-05-07T20:32:15.6437507Z torch.manual_seed(2025) 2025-05-07T20:32:15.6437851Z 2025-05-07T20:32:15.6438219Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.6438702Z 2025-05-07T20:32:15.6438973Z > x_sign = torch.sign(x) 2025-05-07T20:32:15.6441912Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.6444739Z 2025-05-07T20:32:15.6444926Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:15.6445269Z 2025-05-07T20:32:15.6445420Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.6446076Z self=, 2025-05-07T20:32:15.6446740Z T=16384, 2025-05-07T20:32:15.6447217Z D=5120, 2025-05-07T20:32:15.6447535Z scale_ub=None, 2025-05-07T20:32:15.6447883Z contiguous=True, 2025-05-07T20:32:15.6448229Z compiled=False, 2025-05-07T20:32:15.6448544Z ) 2025-05-07T20:32:15.7182797Z self = 2025-05-07T20:32:15.7183703Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:15.7184211Z 2025-05-07T20:32:15.7184327Z @given( 2025-05-07T20:32:15.7184683Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.7185186Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.7185678Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.7186217Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.7186754Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.7187283Z ) 2025-05-07T20:32:15.7187877Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.7188640Z def test_silu_mul_quant( 2025-05-07T20:32:15.7189025Z self, 2025-05-07T20:32:15.7189332Z T: int, 2025-05-07T20:32:15.7189669Z D: int, 2025-05-07T20:32:15.7190010Z scale_ub: Optional[float], 2025-05-07T20:32:15.7190450Z contiguous: bool, 2025-05-07T20:32:15.7190827Z compiled: bool, 2025-05-07T20:32:15.7191182Z ) -> None: 2025-05-07T20:32:15.7191731Z torch.manual_seed(2025) 2025-05-07T20:32:15.7192117Z 2025-05-07T20:32:15.7192555Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.7196237Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.7200050Z 2025-05-07T20:32:15.7200255Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:15.7200639Z 2025-05-07T20:32:15.7200807Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.7201528Z self=, 2025-05-07T20:32:15.7202205Z T=4096, 2025-05-07T20:32:15.7202500Z D=5120, 2025-05-07T20:32:15.7202801Z scale_ub=None, 2025-05-07T20:32:15.7203140Z contiguous=True, 2025-05-07T20:32:15.7203499Z compiled=False, 2025-05-07T20:32:15.7203836Z ) 2025-05-07T20:32:15.7204361Z self = 2025-05-07T20:32:15.7205182Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:15.7205647Z 2025-05-07T20:32:15.7205783Z @given( 2025-05-07T20:32:15.7206138Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.7206916Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.7207445Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.7207979Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.7208523Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.7209002Z ) 2025-05-07T20:32:15.7209576Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.7210329Z def test_silu_mul_quant( 2025-05-07T20:32:15.7210706Z self, 2025-05-07T20:32:15.7211012Z T: int, 2025-05-07T20:32:15.7211325Z D: int, 2025-05-07T20:32:15.7211686Z scale_ub: Optional[float], 2025-05-07T20:32:15.7212261Z contiguous: bool, 2025-05-07T20:32:15.7212660Z compiled: bool, 2025-05-07T20:32:15.7213005Z ) -> None: 2025-05-07T20:32:15.7213344Z torch.manual_seed(2025) 2025-05-07T20:32:15.7214001Z 2025-05-07T20:32:15.7214467Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.7218401Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.7222060Z 2025-05-07T20:32:15.7222258Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:15.7222640Z 2025-05-07T20:32:15.7222806Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.7223541Z self=, 2025-05-07T20:32:15.7224253Z T=2048, 2025-05-07T20:32:15.7224559Z D=5120, 2025-05-07T20:32:15.7224863Z scale_ub=None, 2025-05-07T20:32:15.7225192Z contiguous=False, 2025-05-07T20:32:15.7225548Z compiled=False, 2025-05-07T20:32:15.7225867Z ) 2025-05-07T20:32:15.7226381Z self = 2025-05-07T20:32:15.7227308Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:15.7227776Z 2025-05-07T20:32:15.7227907Z @given( 2025-05-07T20:32:15.7228283Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.7228817Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.7229340Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.7230014Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.7230574Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.7231067Z ) 2025-05-07T20:32:15.7231686Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.7232462Z def test_silu_mul_quant( 2025-05-07T20:32:15.7232870Z self, 2025-05-07T20:32:15.7233182Z T: int, 2025-05-07T20:32:15.7233493Z D: int, 2025-05-07T20:32:15.7233846Z scale_ub: Optional[float], 2025-05-07T20:32:15.7234304Z contiguous: bool, 2025-05-07T20:32:15.7234702Z compiled: bool, 2025-05-07T20:32:15.7235077Z ) -> None: 2025-05-07T20:32:15.7235426Z torch.manual_seed(2025) 2025-05-07T20:32:15.7235839Z 2025-05-07T20:32:15.7236285Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.7240345Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.7244023Z 2025-05-07T20:32:15.7244221Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:15.7244597Z 2025-05-07T20:32:15.7244775Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.7245492Z self=, 2025-05-07T20:32:15.7246210Z T=4096, 2025-05-07T20:32:15.7246490Z D=7168, 2025-05-07T20:32:15.7246767Z scale_ub=None, 2025-05-07T20:32:15.7247091Z contiguous=True, 2025-05-07T20:32:15.7247450Z compiled=True, 2025-05-07T20:32:15.7247750Z ) 2025-05-07T20:32:15.7248212Z self = 2025-05-07T20:32:15.7249113Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:15.7249522Z 2025-05-07T20:32:15.7249647Z @given( 2025-05-07T20:32:15.7249957Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.7250402Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.7250830Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.7251287Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.7251873Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.7252288Z ) 2025-05-07T20:32:15.7252785Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.7253415Z def test_silu_mul_quant( 2025-05-07T20:32:15.7253759Z self, 2025-05-07T20:32:15.7254030Z T: int, 2025-05-07T20:32:15.7254299Z D: int, 2025-05-07T20:32:15.7254599Z scale_ub: Optional[float], 2025-05-07T20:32:15.7254982Z contiguous: bool, 2025-05-07T20:32:15.7255306Z compiled: bool, 2025-05-07T20:32:15.7255637Z ) -> None: 2025-05-07T20:32:15.7255938Z torch.manual_seed(2025) 2025-05-07T20:32:15.7256269Z 2025-05-07T20:32:15.7256645Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.7259776Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.7262738Z 2025-05-07T20:32:15.7262908Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:15.7263211Z 2025-05-07T20:32:15.7263359Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.7263948Z self=, 2025-05-07T20:32:15.7264523Z T=2048, 2025-05-07T20:32:15.7264789Z D=5120, 2025-05-07T20:32:15.7265048Z scale_ub=1200.0, 2025-05-07T20:32:15.7265356Z contiguous=False, 2025-05-07T20:32:15.7265665Z compiled=False, 2025-05-07T20:32:15.7265938Z ) 2025-05-07T20:32:15.7266383Z self = 2025-05-07T20:32:15.7267108Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:15.7267505Z 2025-05-07T20:32:15.7267617Z @given( 2025-05-07T20:32:15.7267922Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.7268358Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.7268799Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.7269280Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.7269762Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.7270183Z ) 2025-05-07T20:32:15.7270721Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.7271427Z def test_silu_mul_quant( 2025-05-07T20:32:15.7271816Z self, 2025-05-07T20:32:15.7272105Z T: int, 2025-05-07T20:32:15.7272421Z D: int, 2025-05-07T20:32:15.7272767Z scale_ub: Optional[float], 2025-05-07T20:32:15.7273219Z contiguous: bool, 2025-05-07T20:32:15.7273596Z compiled: bool, 2025-05-07T20:32:15.7273942Z ) -> None: 2025-05-07T20:32:15.7274277Z torch.manual_seed(2025) 2025-05-07T20:32:15.7274657Z 2025-05-07T20:32:15.7275081Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.7279605Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.7283012Z 2025-05-07T20:32:15.7283212Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:15.7283571Z 2025-05-07T20:32:15.7283729Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.7284411Z self=, 2025-05-07T20:32:15.7285094Z T=4096, 2025-05-07T20:32:15.7285387Z D=7168, 2025-05-07T20:32:15.7285670Z scale_ub=1200.0, 2025-05-07T20:32:15.7286010Z contiguous=True, 2025-05-07T20:32:15.7286358Z compiled=False, 2025-05-07T20:32:15.7286679Z ) 2025-05-07T20:32:15.8330049Z self = 2025-05-07T20:32:15.8331003Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:15.8331473Z 2025-05-07T20:32:15.8331589Z @given( 2025-05-07T20:32:15.8332035Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.8332534Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.8333028Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.8333867Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.8334388Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.8334839Z ) 2025-05-07T20:32:15.8335404Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.8336162Z def test_silu_mul_quant( 2025-05-07T20:32:15.8336666Z self, 2025-05-07T20:32:15.8336964Z T: int, 2025-05-07T20:32:15.8337267Z D: int, 2025-05-07T20:32:15.8337593Z scale_ub: Optional[float], 2025-05-07T20:32:15.8338017Z contiguous: bool, 2025-05-07T20:32:15.8338395Z compiled: bool, 2025-05-07T20:32:15.8338744Z ) -> None: 2025-05-07T20:32:15.8339076Z torch.manual_seed(2025) 2025-05-07T20:32:15.8339465Z 2025-05-07T20:32:15.8339889Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.8343589Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.8347036Z 2025-05-07T20:32:15.8347212Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:15.8347570Z 2025-05-07T20:32:15.8347730Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.8348403Z self=, 2025-05-07T20:32:15.8349065Z T=16384, 2025-05-07T20:32:15.8349364Z D=7168, 2025-05-07T20:32:15.8349649Z scale_ub=None, 2025-05-07T20:32:15.8349933Z contiguous=False, 2025-05-07T20:32:15.8350239Z compiled=True, 2025-05-07T20:32:15.8350529Z ) 2025-05-07T20:32:15.8350992Z self = 2025-05-07T20:32:15.8351766Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:15.8352228Z 2025-05-07T20:32:15.8352360Z @given( 2025-05-07T20:32:15.8352692Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.8353182Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.8353680Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.8354456Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.8355023Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.8355508Z ) 2025-05-07T20:32:15.8356111Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.8356862Z def test_silu_mul_quant( 2025-05-07T20:32:15.8357274Z self, 2025-05-07T20:32:15.8357593Z T: int, 2025-05-07T20:32:15.8357912Z D: int, 2025-05-07T20:32:15.8370009Z scale_ub: Optional[float], 2025-05-07T20:32:15.8370512Z contiguous: bool, 2025-05-07T20:32:15.8370903Z compiled: bool, 2025-05-07T20:32:15.8371269Z ) -> None: 2025-05-07T20:32:15.8371621Z torch.manual_seed(2025) 2025-05-07T20:32:15.8372123Z 2025-05-07T20:32:15.8372585Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.8376336Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.8379707Z 2025-05-07T20:32:15.8379912Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:15.8380278Z 2025-05-07T20:32:15.8380453Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.8381150Z self=, 2025-05-07T20:32:15.8381845Z T=4096, 2025-05-07T20:32:15.8382239Z D=7168, 2025-05-07T20:32:15.8382541Z scale_ub=None, 2025-05-07T20:32:15.8382894Z contiguous=True, 2025-05-07T20:32:15.8383264Z compiled=False, 2025-05-07T20:32:15.8383586Z ) 2025-05-07T20:32:15.8384127Z self = 2025-05-07T20:32:15.8384980Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:15.8385453Z 2025-05-07T20:32:15.8385581Z @given( 2025-05-07T20:32:15.8385943Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.8386464Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.8386977Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.8387525Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.8388065Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.8388530Z ) 2025-05-07T20:32:15.8389112Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.8389881Z def test_silu_mul_quant( 2025-05-07T20:32:15.8390276Z self, 2025-05-07T20:32:15.8390585Z T: int, 2025-05-07T20:32:15.8390892Z D: int, 2025-05-07T20:32:15.8391248Z scale_ub: Optional[float], 2025-05-07T20:32:15.8391695Z contiguous: bool, 2025-05-07T20:32:15.8392079Z compiled: bool, 2025-05-07T20:32:15.8392437Z ) -> None: 2025-05-07T20:32:15.8392782Z torch.manual_seed(2025) 2025-05-07T20:32:15.8393169Z 2025-05-07T20:32:15.8393604Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.8397390Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.8400576Z 2025-05-07T20:32:15.8400888Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:15.8401228Z 2025-05-07T20:32:15.8401385Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.8402033Z self=, 2025-05-07T20:32:15.8402665Z T=16384, 2025-05-07T20:32:15.8402967Z D=7168, 2025-05-07T20:32:15.8403240Z scale_ub=None, 2025-05-07T20:32:15.8403571Z contiguous=True, 2025-05-07T20:32:15.8403919Z compiled=False, 2025-05-07T20:32:15.8404224Z ) 2025-05-07T20:32:15.8404735Z self = 2025-05-07T20:32:15.8405585Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:15.8406070Z 2025-05-07T20:32:15.8406712Z @given( 2025-05-07T20:32:15.8407100Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.8407668Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.8408200Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.8408753Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.8409309Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.8409793Z ) 2025-05-07T20:32:15.8410376Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.8411157Z def test_silu_mul_quant( 2025-05-07T20:32:15.8411557Z self, 2025-05-07T20:32:15.8412123Z T: int, 2025-05-07T20:32:15.8412454Z D: int, 2025-05-07T20:32:15.8412805Z scale_ub: Optional[float], 2025-05-07T20:32:15.8413251Z contiguous: bool, 2025-05-07T20:32:15.8413633Z compiled: bool, 2025-05-07T20:32:15.8413995Z ) -> None: 2025-05-07T20:32:15.8414338Z torch.manual_seed(2025) 2025-05-07T20:32:15.8414863Z 2025-05-07T20:32:15.8415307Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.8419073Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.8422516Z 2025-05-07T20:32:15.8422728Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:15.8423090Z 2025-05-07T20:32:15.8423266Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.8423962Z self=, 2025-05-07T20:32:15.8424663Z T=16384, 2025-05-07T20:32:15.8424968Z D=7168, 2025-05-07T20:32:15.8425263Z scale_ub=1200.0, 2025-05-07T20:32:15.8425618Z contiguous=True, 2025-05-07T20:32:15.8425985Z compiled=False, 2025-05-07T20:32:15.8426300Z ) 2025-05-07T20:32:15.8426808Z self = 2025-05-07T20:32:15.8427697Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:15.8428176Z 2025-05-07T20:32:15.8428300Z @given( 2025-05-07T20:32:15.8428674Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.8429208Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.8429721Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.8430286Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.8430846Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.8431332Z ) 2025-05-07T20:32:15.8431927Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.8432699Z def test_silu_mul_quant( 2025-05-07T20:32:15.8433096Z self, 2025-05-07T20:32:15.8433399Z T: int, 2025-05-07T20:32:15.8433925Z D: int, 2025-05-07T20:32:15.8434299Z scale_ub: Optional[float], 2025-05-07T20:32:15.8434735Z contiguous: bool, 2025-05-07T20:32:15.8435139Z compiled: bool, 2025-05-07T20:32:15.8435516Z ) -> None: 2025-05-07T20:32:15.8435861Z torch.manual_seed(2025) 2025-05-07T20:32:15.8436263Z 2025-05-07T20:32:15.8436707Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.8440543Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.8443988Z 2025-05-07T20:32:15.8444200Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:15.8444566Z 2025-05-07T20:32:15.8444733Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.8445436Z self=, 2025-05-07T20:32:15.8446128Z T=128, 2025-05-07T20:32:15.8446519Z D=5120, 2025-05-07T20:32:15.8446826Z scale_ub=1200.0, 2025-05-07T20:32:15.8447189Z contiguous=False, 2025-05-07T20:32:15.8447552Z compiled=False, 2025-05-07T20:32:15.8447888Z ) 2025-05-07T20:32:15.9681061Z self = 2025-05-07T20:32:15.9682511Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:15.9683838Z 2025-05-07T20:32:15.9684008Z @given( 2025-05-07T20:32:15.9684467Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.9685098Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.9685694Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.9686349Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.9687002Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.9687330Z ) 2025-05-07T20:32:15.9687678Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.9688136Z def test_silu_mul_quant( 2025-05-07T20:32:15.9688384Z self, 2025-05-07T20:32:15.9688577Z T: int, 2025-05-07T20:32:15.9688778Z D: int, 2025-05-07T20:32:15.9688999Z scale_ub: Optional[float], 2025-05-07T20:32:15.9689268Z contiguous: bool, 2025-05-07T20:32:15.9689514Z compiled: bool, 2025-05-07T20:32:15.9689753Z ) -> None: 2025-05-07T20:32:15.9689970Z torch.manual_seed(2025) 2025-05-07T20:32:15.9690219Z 2025-05-07T20:32:15.9690497Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.9690846Z 2025-05-07T20:32:15.9691044Z x_sign = torch.sign(x) 2025-05-07T20:32:15.9691344Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.9691660Z x = x_sign * x_clamp 2025-05-07T20:32:15.9691995Z x0 = x[:, :D] 2025-05-07T20:32:15.9692213Z x1 = x[:, D:] 2025-05-07T20:32:15.9692421Z 2025-05-07T20:32:15.9692607Z if contiguous: 2025-05-07T20:32:15.9692842Z x0 = x0.contiguous() 2025-05-07T20:32:15.9693101Z x1 = x1.contiguous() 2025-05-07T20:32:15.9693340Z 2025-05-07T20:32:15.9693529Z if scale_ub is not None: 2025-05-07T20:32:15.9693803Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.9694143Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.9694453Z ) 2025-05-07T20:32:15.9694648Z else: 2025-05-07T20:32:15.9694863Z scale_ub_tensor = None 2025-05-07T20:32:15.9695111Z 2025-05-07T20:32:15.9695516Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.9695840Z op = silu_mul_quant 2025-05-07T20:32:15.9696088Z if compiled: 2025-05-07T20:32:15.9696337Z op = torch.compile(op) 2025-05-07T20:32:15.9696637Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.9696908Z 2025-05-07T20:32:15.9697099Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.9697276Z 2025-05-07T20:32:15.9697376Z moe/activation_test.py:117: 2025-05-07T20:32:15.9697680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.9698013Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.9698298Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.9699013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.9699721Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.9700278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.9700983Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.9701670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.9702215Z kernel = self.compile( 2025-05-07T20:32:15.9702775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.9703546Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.9703948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.9704193Z 2025-05-07T20:32:15.9704452Z self = 2025-05-07T20:32:15.9705576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.9707432Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f94f7910cc0>} 2025-05-07T20:32:15.9708822Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.9709879Z context = 2025-05-07T20:32:15.9710183Z 2025-05-07T20:32:15.9710355Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.9710900Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.9711381Z module_map=module_map) 2025-05-07T20:32:15.9711753Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.9712116Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.9712381Z E ^ 2025-05-07T20:32:15.9712850Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.9713320Z 2025-05-07T20:32:15.9713750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.9714288Z 2025-05-07T20:32:15.9714395Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.9714829Z self=, 2025-05-07T20:32:15.9715253Z T=2048, 2025-05-07T20:32:15.9715510Z D=7168, 2025-05-07T20:32:15.9715804Z scale_ub=None, 2025-05-07T20:32:15.9716138Z contiguous=False, 2025-05-07T20:32:15.9716461Z compiled=False, 2025-05-07T20:32:15.9716756Z ) 2025-05-07T20:32:15.9717392Z self = 2025-05-07T20:32:15.9718076Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:15.9718505Z 2025-05-07T20:32:15.9718627Z @given( 2025-05-07T20:32:15.9718988Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.9719467Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.9719948Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.9720473Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.9721022Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.9721474Z ) 2025-05-07T20:32:15.9722032Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.9722798Z def test_silu_mul_quant( 2025-05-07T20:32:15.9723158Z self, 2025-05-07T20:32:15.9723417Z T: int, 2025-05-07T20:32:15.9723683Z D: int, 2025-05-07T20:32:15.9723975Z scale_ub: Optional[float], 2025-05-07T20:32:15.9724303Z contiguous: bool, 2025-05-07T20:32:15.9724548Z compiled: bool, 2025-05-07T20:32:15.9724773Z ) -> None: 2025-05-07T20:32:15.9724984Z torch.manual_seed(2025) 2025-05-07T20:32:15.9725231Z 2025-05-07T20:32:15.9725509Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.9727641Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.9729737Z 2025-05-07T20:32:15.9729872Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:15.9730089Z 2025-05-07T20:32:15.9730193Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.9730619Z self=, 2025-05-07T20:32:15.9731033Z T=128, 2025-05-07T20:32:15.9731217Z D=7168, 2025-05-07T20:32:15.9731412Z scale_ub=1200.0, 2025-05-07T20:32:15.9731639Z contiguous=True, 2025-05-07T20:32:15.9731945Z compiled=True, 2025-05-07T20:32:15.9732149Z ) 2025-05-07T20:32:16.0035995Z self = 2025-05-07T20:32:16.0037159Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.0037530Z 2025-05-07T20:32:16.0037634Z @given( 2025-05-07T20:32:16.0037957Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.0038348Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.0038648Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.0038991Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.0039327Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.0039611Z ) 2025-05-07T20:32:16.0039997Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.0040443Z def test_silu_mul_quant( 2025-05-07T20:32:16.0040692Z self, 2025-05-07T20:32:16.0040890Z T: int, 2025-05-07T20:32:16.0041089Z D: int, 2025-05-07T20:32:16.0041306Z scale_ub: Optional[float], 2025-05-07T20:32:16.0041580Z contiguous: bool, 2025-05-07T20:32:16.0041818Z compiled: bool, 2025-05-07T20:32:16.0042047Z ) -> None: 2025-05-07T20:32:16.0042265Z torch.manual_seed(2025) 2025-05-07T20:32:16.0042507Z 2025-05-07T20:32:16.0042787Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.0043138Z 2025-05-07T20:32:16.0043343Z x_sign = torch.sign(x) 2025-05-07T20:32:16.0043883Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.0044205Z x = x_sign * x_clamp 2025-05-07T20:32:16.0044448Z x0 = x[:, :D] 2025-05-07T20:32:16.0044661Z x1 = x[:, D:] 2025-05-07T20:32:16.0044869Z 2025-05-07T20:32:16.0045056Z if contiguous: 2025-05-07T20:32:16.0045283Z x0 = x0.contiguous() 2025-05-07T20:32:16.0045544Z x1 = x1.contiguous() 2025-05-07T20:32:16.0045788Z 2025-05-07T20:32:16.0045978Z if scale_ub is not None: 2025-05-07T20:32:16.0046254Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.0046597Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.0046906Z ) 2025-05-07T20:32:16.0047098Z else: 2025-05-07T20:32:16.0047316Z scale_ub_tensor = None 2025-05-07T20:32:16.0047567Z 2025-05-07T20:32:16.0047805Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.0048124Z op = silu_mul_quant 2025-05-07T20:32:16.0048387Z if compiled: 2025-05-07T20:32:16.0048631Z op = torch.compile(op) 2025-05-07T20:32:16.0048931Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.0049209Z 2025-05-07T20:32:16.0049396Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.0049567Z 2025-05-07T20:32:16.0049667Z moe/activation_test.py:117: 2025-05-07T20:32:16.0049969Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.0050386Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.0050671Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.0051250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.0051909Z return fn(*args, **kwargs) 2025-05-07T20:32:16.0052663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.0053376Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.0053935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.0054632Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.0055319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.0055872Z kernel = self.compile( 2025-05-07T20:32:16.0056428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.0057108Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.0057562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.0057802Z 2025-05-07T20:32:16.0058023Z self = 2025-05-07T20:32:16.0059145Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.0060572Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f94f7911a80>} 2025-05-07T20:32:16.0061959Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.0063019Z context = 2025-05-07T20:32:16.0063314Z 2025-05-07T20:32:16.0063498Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.0064028Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.0064597Z module_map=module_map) 2025-05-07T20:32:16.0064977Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.0065339Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.0065598Z E ^ 2025-05-07T20:32:16.0066075Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.0066537Z 2025-05-07T20:32:16.0066973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.0067505Z 2025-05-07T20:32:16.0067608Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.0068034Z self=, 2025-05-07T20:32:16.0068453Z T=128, 2025-05-07T20:32:16.0068644Z D=7168, 2025-05-07T20:32:16.0068832Z scale_ub=1200.0, 2025-05-07T20:32:16.0069061Z contiguous=True, 2025-05-07T20:32:16.0069286Z compiled=False, 2025-05-07T20:32:16.0069489Z ) 2025-05-07T20:32:16.0069821Z self = 2025-05-07T20:32:16.0070329Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.0070606Z 2025-05-07T20:32:16.0070684Z @given( 2025-05-07T20:32:16.0070918Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.0071236Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.0071628Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.0071965Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.0072298Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.0072588Z ) 2025-05-07T20:32:16.0072939Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.0073434Z def test_silu_mul_quant( 2025-05-07T20:32:16.0073678Z self, 2025-05-07T20:32:16.0073868Z T: int, 2025-05-07T20:32:16.0074067Z D: int, 2025-05-07T20:32:16.0074291Z scale_ub: Optional[float], 2025-05-07T20:32:16.0074564Z contiguous: bool, 2025-05-07T20:32:16.0074807Z compiled: bool, 2025-05-07T20:32:16.0075036Z ) -> None: 2025-05-07T20:32:16.0075247Z torch.manual_seed(2025) 2025-05-07T20:32:16.0075491Z 2025-05-07T20:32:16.0075766Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.0076114Z 2025-05-07T20:32:16.0076311Z x_sign = torch.sign(x) 2025-05-07T20:32:16.0076603Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.0078746Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.0080681Z 2025-05-07T20:32:16.0080807Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:16.0081022Z 2025-05-07T20:32:16.0081124Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.0081548Z self=, 2025-05-07T20:32:16.0081966Z T=128, 2025-05-07T20:32:16.0082150Z D=5120, 2025-05-07T20:32:16.0082346Z scale_ub=1200.0, 2025-05-07T20:32:16.0082573Z contiguous=True, 2025-05-07T20:32:16.0082793Z compiled=True, 2025-05-07T20:32:16.0083000Z ) 2025-05-07T20:32:16.0083326Z self = 2025-05-07T20:32:16.0083834Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.0084108Z 2025-05-07T20:32:16.0084185Z @given( 2025-05-07T20:32:16.0084509Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.0084828Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.0085131Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.0085464Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.0085797Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.0086080Z ) 2025-05-07T20:32:16.0086436Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.0086891Z def test_silu_mul_quant( 2025-05-07T20:32:16.0087140Z self, 2025-05-07T20:32:16.0087355Z T: int, 2025-05-07T20:32:16.0087576Z D: int, 2025-05-07T20:32:16.0087795Z scale_ub: Optional[float], 2025-05-07T20:32:16.0088066Z contiguous: bool, 2025-05-07T20:32:16.0088309Z compiled: bool, 2025-05-07T20:32:16.0088534Z ) -> None: 2025-05-07T20:32:16.0088750Z torch.manual_seed(2025) 2025-05-07T20:32:16.0088992Z 2025-05-07T20:32:16.0089295Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.0089645Z 2025-05-07T20:32:16.0089835Z x_sign = torch.sign(x) 2025-05-07T20:32:16.0090129Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.0101385Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.0103498Z 2025-05-07T20:32:16.0103623Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:16.0103854Z 2025-05-07T20:32:16.0103969Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.0104399Z self=, 2025-05-07T20:32:16.0104815Z T=128, 2025-05-07T20:32:16.0105005Z D=7168, 2025-05-07T20:32:16.0105203Z scale_ub=None, 2025-05-07T20:32:16.0105422Z contiguous=True, 2025-05-07T20:32:16.0105643Z compiled=True, 2025-05-07T20:32:16.0105855Z ) 2025-05-07T20:32:16.4968068Z self = 2025-05-07T20:32:16.4968788Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.4969070Z 2025-05-07T20:32:16.4969152Z @given( 2025-05-07T20:32:16.4969399Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.4969736Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.4970047Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.4970388Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.4970730Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.4971023Z ) 2025-05-07T20:32:16.4971383Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.4971935Z def test_silu_mul_quant( 2025-05-07T20:32:16.4972183Z self, 2025-05-07T20:32:16.4972388Z T: int, 2025-05-07T20:32:16.4972595Z D: int, 2025-05-07T20:32:16.4972821Z scale_ub: Optional[float], 2025-05-07T20:32:16.4973103Z contiguous: bool, 2025-05-07T20:32:16.4973352Z compiled: bool, 2025-05-07T20:32:16.4973583Z ) -> None: 2025-05-07T20:32:16.4973808Z torch.manual_seed(2025) 2025-05-07T20:32:16.4974059Z 2025-05-07T20:32:16.4974337Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.4976809Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.4978833Z 2025-05-07T20:32:16.4978956Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.4979182Z 2025-05-07T20:32:16.5019856Z FAILED 2025-05-07T20:32:16.5020033Z 2025-05-07T20:32:16.5020514Z =================================== FAILURES =================================== 2025-05-07T20:32:16.5021152Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:16.5021779Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:16.5022654Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:32:16.5023425Z | yield 2025-05-07T20:32:16.5024022Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run 2025-05-07T20:32:16.5024748Z | self._callTestMethod(testMethod) 2025-05-07T20:32:16.5025530Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod 2025-05-07T20:32:16.5026603Z | if method() is not None: 2025-05-07T20:32:16.5026953Z | ^^^^^^^^ 2025-05-07T20:32:16.5027856Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:16.5028883Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5029397Z | ^^^^^^^ 2025-05-07T20:32:16.5030190Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:16.5031101Z | raise the_error_hypothesis_found 2025-05-07T20:32:16.5031698Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:16.5032282Z +-+---------------- 1 ---------------- 2025-05-07T20:32:16.5032687Z | Traceback (most recent call last): 2025-05-07T20:32:16.5033692Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:16.5034802Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5035320Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:16.5038167Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.5040990Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:16.5041607Z | self=, 2025-05-07T20:32:16.5042176Z | T=2048, 2025-05-07T20:32:16.5042499Z | D=5120, # or any other generated value 2025-05-07T20:32:16.5042975Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:16.5043485Z | contiguous=True, # or any other generated value 2025-05-07T20:32:16.5043994Z | compiled=False, # or any other generated value 2025-05-07T20:32:16.5044423Z | ) 2025-05-07T20:32:16.5044672Z | 2025-05-07T20:32:16.5045557Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:16.5046445Z +---------------- 2 ---------------- 2025-05-07T20:32:16.5046849Z | Traceback (most recent call last): 2025-05-07T20:32:16.5047844Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:16.5048929Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5049451Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:16.5052259Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.5055144Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:16.5055755Z | self=, 2025-05-07T20:32:16.5057057Z | T=128, 2025-05-07T20:32:16.5057339Z | D=7168, 2025-05-07T20:32:16.5057634Z | scale_ub=None, 2025-05-07T20:32:16.5057966Z | contiguous=True, 2025-05-07T20:32:16.5058241Z | compiled=True, 2025-05-07T20:32:16.5058481Z | ) 2025-05-07T20:32:16.5058667Z | 2025-05-07T20:32:16.5059214Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:16.5059910Z +---------------- 3 ---------------- 2025-05-07T20:32:16.5060208Z | Traceback (most recent call last): 2025-05-07T20:32:16.5060947Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:16.5061755Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5062144Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:16.5064516Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.5066582Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:16.5067037Z | self=, 2025-05-07T20:32:16.5067462Z | T=128, 2025-05-07T20:32:16.5067668Z | D=5120, 2025-05-07T20:32:16.5067882Z | scale_ub=1200.0, 2025-05-07T20:32:16.5068132Z | contiguous=True, 2025-05-07T20:32:16.5068384Z | compiled=True, 2025-05-07T20:32:16.5068617Z | ) 2025-05-07T20:32:16.5068802Z | 2025-05-07T20:32:16.5069343Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:16.5069964Z +---------------- 4 ---------------- 2025-05-07T20:32:16.5070272Z | Traceback (most recent call last): 2025-05-07T20:32:16.5071014Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:16.5071851Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.5072145Z | ^^^^^^^^ 2025-05-07T20:32:16.5072807Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:16.5073645Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.5074118Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:16.5075242Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:16.5076363Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.5077230Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:16.5078279Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5078906Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:16.5079816Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:16.5080928Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.5081648Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:16.5082582Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:16.5083582Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.5084154Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:16.5085004Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:16.5085819Z | fn() 2025-05-07T20:32:16.5086631Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:16.5087543Z | self.fn.run( 2025-05-07T20:32:16.5088303Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:16.5089138Z | kernel = self.compile( 2025-05-07T20:32:16.5089500Z | ^^^^^^^^^^^^^ 2025-05-07T20:32:16.5090349Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:16.5091359Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5092018Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:16.5092938Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:16.5094075Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5094758Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:16.5095285Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5095780Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.5096151Z | ^ 2025-05-07T20:32:16.5096800Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5097618Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:16.5098178Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:16.5098910Z | self=, 2025-05-07T20:32:16.5099686Z | T=1, # or any other generated value 2025-05-07T20:32:16.5100110Z | D=5120, # or any other generated value 2025-05-07T20:32:16.5100576Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:16.5101074Z | contiguous=True, # or any other generated value 2025-05-07T20:32:16.5101570Z | compiled=True, # or any other generated value 2025-05-07T20:32:16.5101987Z | ) 2025-05-07T20:32:16.5102238Z | 2025-05-07T20:32:16.5102964Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:16.5103820Z +------------------------------------ 2025-05-07T20:32:16.5104322Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:16.5104859Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5105428Z self=, 2025-05-07T20:32:16.5121799Z T=1, 2025-05-07T20:32:16.5122090Z D=5120, 2025-05-07T20:32:16.5122353Z scale_ub=None, 2025-05-07T20:32:16.5122656Z contiguous=True, 2025-05-07T20:32:16.5122967Z compiled=True, 2025-05-07T20:32:16.5123247Z ) 2025-05-07T20:32:16.5123700Z self = 2025-05-07T20:32:16.5124398Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.5124956Z 2025-05-07T20:32:16.5125064Z @given( 2025-05-07T20:32:16.5125377Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5125803Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5126225Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5126662Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5127191Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5127572Z ) 2025-05-07T20:32:16.5128028Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5128643Z def test_silu_mul_quant( 2025-05-07T20:32:16.5128992Z self, 2025-05-07T20:32:16.5129241Z T: int, 2025-05-07T20:32:16.5129502Z D: int, 2025-05-07T20:32:16.5129796Z scale_ub: Optional[float], 2025-05-07T20:32:16.5130172Z contiguous: bool, 2025-05-07T20:32:16.5130507Z compiled: bool, 2025-05-07T20:32:16.5130807Z ) -> None: 2025-05-07T20:32:16.5131097Z torch.manual_seed(2025) 2025-05-07T20:32:16.5131435Z 2025-05-07T20:32:16.5131921Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5132394Z 2025-05-07T20:32:16.5132644Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5133014Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5133449Z x = x_sign * x_clamp 2025-05-07T20:32:16.5133782Z x0 = x[:, :D] 2025-05-07T20:32:16.5134082Z x1 = x[:, D:] 2025-05-07T20:32:16.5134353Z 2025-05-07T20:32:16.5134596Z if contiguous: 2025-05-07T20:32:16.5134916Z x0 = x0.contiguous() 2025-05-07T20:32:16.5135262Z x1 = x1.contiguous() 2025-05-07T20:32:16.5135572Z 2025-05-07T20:32:16.5135828Z if scale_ub is not None: 2025-05-07T20:32:16.5136338Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5136792Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5137263Z ) 2025-05-07T20:32:16.5137515Z else: 2025-05-07T20:32:16.5137784Z scale_ub_tensor = None 2025-05-07T20:32:16.5138117Z 2025-05-07T20:32:16.5138424Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5138830Z op = silu_mul_quant 2025-05-07T20:32:16.5139155Z if compiled: 2025-05-07T20:32:16.5139491Z op = torch.compile(op) 2025-05-07T20:32:16.5139876Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5140224Z 2025-05-07T20:32:16.5140469Z y_fp8, y_scale = fn() 2025-05-07T20:32:16.5140998Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:16.5141376Z 2025-05-07T20:32:16.5141681Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5142113Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:16.5142484Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:16.5142889Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:16.5143353Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.5143754Z 2025-05-07T20:32:16.5144017Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.5144281Z 2025-05-07T20:32:16.5144415Z moe/activation_test.py:126: 2025-05-07T20:32:16.5144801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5145236Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:16.5145660Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.5146740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:16.5147767Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.5148497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5149418Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5150412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:16.5151388Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.5152368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:16.5153287Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.5154136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:16.5154858Z fn() 2025-05-07T20:32:16.5155576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:16.5156400Z self.fn.run( 2025-05-07T20:32:16.5157049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5157797Z kernel = self.compile( 2025-05-07T20:32:16.5158552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5159467Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5159995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5160312Z 2025-05-07T20:32:16.5160578Z self = 2025-05-07T20:32:16.5162039Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5163931Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96d4960c20>} 2025-05-07T20:32:16.5165734Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5167115Z context = 2025-05-07T20:32:16.5167504Z 2025-05-07T20:32:16.5167719Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5168424Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5169145Z module_map=module_map) 2025-05-07T20:32:16.5169631Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5170127Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.5170494Z E ^ 2025-05-07T20:32:16.5171154Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5171936Z 2025-05-07T20:32:16.5172541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5173292Z 2025-05-07T20:32:16.5173444Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5174010Z self=, 2025-05-07T20:32:16.5174562Z T=2048, 2025-05-07T20:32:16.5174807Z D=5120, 2025-05-07T20:32:16.5175059Z scale_ub=1200.0, 2025-05-07T20:32:16.5175350Z contiguous=True, 2025-05-07T20:32:16.5175648Z compiled=False, 2025-05-07T20:32:16.5175920Z ) 2025-05-07T20:32:16.5176356Z self = 2025-05-07T20:32:16.5177029Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.5177406Z 2025-05-07T20:32:16.5177513Z @given( 2025-05-07T20:32:16.5177814Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5178313Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5178730Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5179180Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5179631Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5180024Z ) 2025-05-07T20:32:16.5180546Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5181170Z def test_silu_mul_quant( 2025-05-07T20:32:16.5181494Z self, 2025-05-07T20:32:16.5181747Z T: int, 2025-05-07T20:32:16.5182015Z D: int, 2025-05-07T20:32:16.5182297Z scale_ub: Optional[float], 2025-05-07T20:32:16.5182656Z contiguous: bool, 2025-05-07T20:32:16.5182962Z compiled: bool, 2025-05-07T20:32:16.5183265Z ) -> None: 2025-05-07T20:32:16.5183547Z torch.manual_seed(2025) 2025-05-07T20:32:16.5183865Z 2025-05-07T20:32:16.5184218Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5184674Z 2025-05-07T20:32:16.5184918Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5185298Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5185705Z x = x_sign * x_clamp 2025-05-07T20:32:16.5186012Z x0 = x[:, :D] 2025-05-07T20:32:16.5186293Z x1 = x[:, D:] 2025-05-07T20:32:16.5186573Z 2025-05-07T20:32:16.5186812Z if contiguous: 2025-05-07T20:32:16.5187109Z x0 = x0.contiguous() 2025-05-07T20:32:16.5187437Z x1 = x1.contiguous() 2025-05-07T20:32:16.5187742Z 2025-05-07T20:32:16.5187982Z if scale_ub is not None: 2025-05-07T20:32:16.5188339Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5188768Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5189165Z ) 2025-05-07T20:32:16.5189426Z else: 2025-05-07T20:32:16.5189700Z scale_ub_tensor = None 2025-05-07T20:32:16.5190023Z 2025-05-07T20:32:16.5190342Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5190777Z op = silu_mul_quant 2025-05-07T20:32:16.5191116Z if compiled: 2025-05-07T20:32:16.5191457Z op = torch.compile(op) 2025-05-07T20:32:16.5191869Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5192248Z 2025-05-07T20:32:16.5192512Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.5192740Z 2025-05-07T20:32:16.5192883Z moe/activation_test.py:117: 2025-05-07T20:32:16.5193390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5193855Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.5194249Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5195234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.5196223Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.5196998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5198026Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5198934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5199643Z kernel = self.compile( 2025-05-07T20:32:16.5200367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5201251Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5222853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5223205Z 2025-05-07T20:32:16.5223488Z self = 2025-05-07T20:32:16.5225016Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5227301Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96d4820180>} 2025-05-07T20:32:16.5229335Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5230812Z context = 2025-05-07T20:32:16.5231222Z 2025-05-07T20:32:16.5231453Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5232195Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5232845Z module_map=module_map) 2025-05-07T20:32:16.5233326Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5233790Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.5234133Z E ^ 2025-05-07T20:32:16.5234744Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5235381Z 2025-05-07T20:32:16.5235963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5236677Z 2025-05-07T20:32:16.5236824Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5237389Z self=, 2025-05-07T20:32:16.5237945Z T=2048, 2025-05-07T20:32:16.5238196Z D=5120, 2025-05-07T20:32:16.5238451Z scale_ub=1200.0, 2025-05-07T20:32:16.5238744Z contiguous=True, 2025-05-07T20:32:16.5239047Z compiled=True, 2025-05-07T20:32:16.5239324Z ) 2025-05-07T20:32:16.5239753Z self = 2025-05-07T20:32:16.5240431Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.5240808Z 2025-05-07T20:32:16.5240915Z @given( 2025-05-07T20:32:16.5241223Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5241642Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5242053Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5242498Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5243151Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5243562Z ) 2025-05-07T20:32:16.5244059Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5244684Z def test_silu_mul_quant( 2025-05-07T20:32:16.5245016Z self, 2025-05-07T20:32:16.5245274Z T: int, 2025-05-07T20:32:16.5245552Z D: int, 2025-05-07T20:32:16.5245812Z scale_ub: Optional[float], 2025-05-07T20:32:16.5246092Z contiguous: bool, 2025-05-07T20:32:16.5246346Z compiled: bool, 2025-05-07T20:32:16.5246572Z ) -> None: 2025-05-07T20:32:16.5246799Z torch.manual_seed(2025) 2025-05-07T20:32:16.5247046Z 2025-05-07T20:32:16.5247319Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5247668Z 2025-05-07T20:32:16.5247871Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5248163Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5248478Z x = x_sign * x_clamp 2025-05-07T20:32:16.5248730Z x0 = x[:, :D] 2025-05-07T20:32:16.5248947Z x1 = x[:, D:] 2025-05-07T20:32:16.5249161Z 2025-05-07T20:32:16.5249350Z if contiguous: 2025-05-07T20:32:16.5249585Z x0 = x0.contiguous() 2025-05-07T20:32:16.5249843Z x1 = x1.contiguous() 2025-05-07T20:32:16.5250087Z 2025-05-07T20:32:16.5250279Z if scale_ub is not None: 2025-05-07T20:32:16.5250621Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5250966Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5251288Z ) 2025-05-07T20:32:16.5251481Z else: 2025-05-07T20:32:16.5251695Z scale_ub_tensor = None 2025-05-07T20:32:16.5252094Z 2025-05-07T20:32:16.5252385Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5252710Z op = silu_mul_quant 2025-05-07T20:32:16.5252964Z if compiled: 2025-05-07T20:32:16.5253210Z op = torch.compile(op) 2025-05-07T20:32:16.5253514Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5253797Z 2025-05-07T20:32:16.5253988Z y_fp8, y_scale = fn() 2025-05-07T20:32:16.5254280Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:16.5254574Z 2025-05-07T20:32:16.5254816Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5255158Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:16.5255461Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:16.5255786Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:16.5256148Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.5256464Z 2025-05-07T20:32:16.5256673Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.5256870Z 2025-05-07T20:32:16.5256975Z moe/activation_test.py:126: 2025-05-07T20:32:16.5257284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5257633Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:16.5257970Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.5258780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:16.5259562Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.5260124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5260823Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5261534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:16.5262282Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.5263118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:16.5263772Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.5264387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:16.5264916Z fn() 2025-05-07T20:32:16.5265435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:16.5266029Z self.fn.run( 2025-05-07T20:32:16.5266506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5267052Z kernel = self.compile( 2025-05-07T20:32:16.5267600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5268276Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5268692Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5268926Z 2025-05-07T20:32:16.5269143Z self = 2025-05-07T20:32:16.5270253Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5271728Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96d45eaa20>} 2025-05-07T20:32:16.5273117Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5274218Z context = 2025-05-07T20:32:16.5274514Z 2025-05-07T20:32:16.5274687Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5275225Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5275702Z module_map=module_map) 2025-05-07T20:32:16.5276073Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5276434Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.5276713Z E ^ 2025-05-07T20:32:16.5277193Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5277657Z 2025-05-07T20:32:16.5278090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5278623Z 2025-05-07T20:32:16.5278728Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5279152Z self=, 2025-05-07T20:32:16.5279574Z T=16384, 2025-05-07T20:32:16.5279762Z D=7168, 2025-05-07T20:32:16.5279956Z scale_ub=1200.0, 2025-05-07T20:32:16.5280183Z contiguous=False, 2025-05-07T20:32:16.5280408Z compiled=False, 2025-05-07T20:32:16.5280612Z ) 2025-05-07T20:32:16.5280936Z self = 2025-05-07T20:32:16.5281446Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:16.5281745Z 2025-05-07T20:32:16.5281824Z @given( 2025-05-07T20:32:16.5282061Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5282382Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5282689Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5283029Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5283364Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5283650Z ) 2025-05-07T20:32:16.5284095Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5284552Z def test_silu_mul_quant( 2025-05-07T20:32:16.5284789Z self, 2025-05-07T20:32:16.5284987Z T: int, 2025-05-07T20:32:16.5285188Z D: int, 2025-05-07T20:32:16.5285404Z scale_ub: Optional[float], 2025-05-07T20:32:16.5285680Z contiguous: bool, 2025-05-07T20:32:16.5285923Z compiled: bool, 2025-05-07T20:32:16.5286153Z ) -> None: 2025-05-07T20:32:16.5286366Z torch.manual_seed(2025) 2025-05-07T20:32:16.5286606Z 2025-05-07T20:32:16.5286883Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5287222Z 2025-05-07T20:32:16.5287416Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5287712Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5288018Z x = x_sign * x_clamp 2025-05-07T20:32:16.5288262Z x0 = x[:, :D] 2025-05-07T20:32:16.5288477Z x1 = x[:, D:] 2025-05-07T20:32:16.5288678Z 2025-05-07T20:32:16.5288869Z if contiguous: 2025-05-07T20:32:16.5289101Z x0 = x0.contiguous() 2025-05-07T20:32:16.5289356Z x1 = x1.contiguous() 2025-05-07T20:32:16.5289597Z 2025-05-07T20:32:16.5289788Z if scale_ub is not None: 2025-05-07T20:32:16.5290055Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5290394Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5290760Z ) 2025-05-07T20:32:16.5290955Z else: 2025-05-07T20:32:16.5291164Z scale_ub_tensor = None 2025-05-07T20:32:16.5291416Z 2025-05-07T20:32:16.5291649Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5292069Z op = silu_mul_quant 2025-05-07T20:32:16.5292374Z if compiled: 2025-05-07T20:32:16.5292626Z op = torch.compile(op) 2025-05-07T20:32:16.5292930Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5293211Z 2025-05-07T20:32:16.5293418Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.5293585Z 2025-05-07T20:32:16.5293686Z moe/activation_test.py:117: 2025-05-07T20:32:16.5293988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5294327Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.5294609Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5295317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.5296031Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.5296582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5297285Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5297976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5298537Z kernel = self.compile( 2025-05-07T20:32:16.5299095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5299766Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5300176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5300413Z 2025-05-07T20:32:16.5300634Z self = 2025-05-07T20:32:16.5301756Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5303180Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96cf06b6a0>} 2025-05-07T20:32:16.5304694Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5305761Z context = 2025-05-07T20:32:16.5306057Z 2025-05-07T20:32:16.5306584Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5307128Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5307602Z module_map=module_map) 2025-05-07T20:32:16.5307974Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5308331Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.5308605Z E ^ 2025-05-07T20:32:16.5309081Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5309552Z 2025-05-07T20:32:16.5309988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5310519Z 2025-05-07T20:32:16.5310628Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5311047Z self=, 2025-05-07T20:32:16.5311462Z T=1, 2025-05-07T20:32:16.5311654Z D=7168, 2025-05-07T20:32:16.5311957Z scale_ub=None, 2025-05-07T20:32:16.5312176Z contiguous=True, 2025-05-07T20:32:16.5312404Z compiled=True, 2025-05-07T20:32:16.5312600Z ) 2025-05-07T20:32:16.5312929Z self = 2025-05-07T20:32:16.5313427Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.5313760Z 2025-05-07T20:32:16.5313843Z @given( 2025-05-07T20:32:16.5314073Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5314394Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5314713Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5315041Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5315376Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5315669Z ) 2025-05-07T20:32:16.5316017Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5316468Z def test_silu_mul_quant( 2025-05-07T20:32:16.5316715Z self, 2025-05-07T20:32:16.5316909Z T: int, 2025-05-07T20:32:16.5317102Z D: int, 2025-05-07T20:32:16.5317322Z scale_ub: Optional[float], 2025-05-07T20:32:16.5317594Z contiguous: bool, 2025-05-07T20:32:16.5317834Z compiled: bool, 2025-05-07T20:32:16.5318059Z ) -> None: 2025-05-07T20:32:16.5318275Z torch.manual_seed(2025) 2025-05-07T20:32:16.5318513Z 2025-05-07T20:32:16.5318793Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5319139Z 2025-05-07T20:32:16.5319337Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5319632Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5319946Z x = x_sign * x_clamp 2025-05-07T20:32:16.5320182Z x0 = x[:, :D] 2025-05-07T20:32:16.5320399Z x1 = x[:, D:] 2025-05-07T20:32:16.5320608Z 2025-05-07T20:32:16.5320786Z if contiguous: 2025-05-07T20:32:16.5321020Z x0 = x0.contiguous() 2025-05-07T20:32:16.5321280Z x1 = x1.contiguous() 2025-05-07T20:32:16.5321518Z 2025-05-07T20:32:16.5321712Z if scale_ub is not None: 2025-05-07T20:32:16.5321991Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5322330Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5322642Z ) 2025-05-07T20:32:16.5322842Z else: 2025-05-07T20:32:16.5323056Z scale_ub_tensor = None 2025-05-07T20:32:16.5323306Z 2025-05-07T20:32:16.5323675Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5324000Z op = silu_mul_quant 2025-05-07T20:32:16.5324252Z if compiled: 2025-05-07T20:32:16.5324504Z op = torch.compile(op) 2025-05-07T20:32:16.5324805Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5325084Z 2025-05-07T20:32:16.5325283Z y_fp8, y_scale = fn() 2025-05-07T20:32:16.5325573Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:16.5325865Z 2025-05-07T20:32:16.5326107Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5326448Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:16.5326750Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:16.5327068Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:16.5327437Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.5327756Z 2025-05-07T20:32:16.5327954Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.5328158Z 2025-05-07T20:32:16.5328263Z moe/activation_test.py:126: 2025-05-07T20:32:16.5328566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5328906Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:16.5329237Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.5330047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:16.5330877Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.5331436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5332361Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5333432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:16.5334520Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.5335591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:16.5336524Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.5337389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:16.5338156Z fn() 2025-05-07T20:32:16.5338897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:16.5339749Z self.fn.run( 2025-05-07T20:32:16.5340445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5341239Z kernel = self.compile( 2025-05-07T20:32:16.5342061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5343074Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5343661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5344001Z 2025-05-07T20:32:16.5344298Z self = 2025-05-07T20:32:16.5345909Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5348042Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96cec65620>} 2025-05-07T20:32:16.5350268Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5351902Z context = 2025-05-07T20:32:16.5352378Z 2025-05-07T20:32:16.5352645Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5353484Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5354238Z module_map=module_map) 2025-05-07T20:32:16.5354797Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5355365Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.5355768Z E ^ 2025-05-07T20:32:16.5356440Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5357182Z 2025-05-07T20:32:16.5357925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5358856Z 2025-05-07T20:32:16.5359028Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5359717Z self=, 2025-05-07T20:32:16.5360391Z T=4096, 2025-05-07T20:32:16.5360687Z D=5120, 2025-05-07T20:32:16.5360987Z scale_ub=None, 2025-05-07T20:32:16.5361315Z contiguous=False, 2025-05-07T20:32:16.5361666Z compiled=False, 2025-05-07T20:32:16.5362052Z ) 2025-05-07T20:32:16.5362511Z self = 2025-05-07T20:32:16.5363307Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:16.5363756Z 2025-05-07T20:32:16.5363883Z @given( 2025-05-07T20:32:16.5364231Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5364789Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5365279Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5365811Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5366344Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5366799Z ) 2025-05-07T20:32:16.5367413Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5368149Z def test_silu_mul_quant( 2025-05-07T20:32:16.5368552Z self, 2025-05-07T20:32:16.5368848Z T: int, 2025-05-07T20:32:16.5369123Z D: int, 2025-05-07T20:32:16.5369435Z scale_ub: Optional[float], 2025-05-07T20:32:16.5369845Z contiguous: bool, 2025-05-07T20:32:16.5370229Z compiled: bool, 2025-05-07T20:32:16.5370587Z ) -> None: 2025-05-07T20:32:16.5370916Z torch.manual_seed(2025) 2025-05-07T20:32:16.5371279Z 2025-05-07T20:32:16.5371723Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5372370Z 2025-05-07T20:32:16.5372675Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5373129Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5373637Z x = x_sign * x_clamp 2025-05-07T20:32:16.5373998Z x0 = x[:, :D] 2025-05-07T20:32:16.5374328Z x1 = x[:, D:] 2025-05-07T20:32:16.5374651Z 2025-05-07T20:32:16.5374926Z if contiguous: 2025-05-07T20:32:16.5375277Z x0 = x0.contiguous() 2025-05-07T20:32:16.5375671Z x1 = x1.contiguous() 2025-05-07T20:32:16.5376044Z 2025-05-07T20:32:16.5376322Z if scale_ub is not None: 2025-05-07T20:32:16.5376733Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5377261Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5377743Z ) 2025-05-07T20:32:16.5378050Z else: 2025-05-07T20:32:16.5378382Z scale_ub_tensor = None 2025-05-07T20:32:16.5378784Z 2025-05-07T20:32:16.5379160Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5379673Z op = silu_mul_quant 2025-05-07T20:32:16.5380082Z if compiled: 2025-05-07T20:32:16.5380636Z op = torch.compile(op) 2025-05-07T20:32:16.5381126Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5381569Z 2025-05-07T20:32:16.5381860Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.5382145Z 2025-05-07T20:32:16.5382302Z moe/activation_test.py:117: 2025-05-07T20:32:16.5382796Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5383363Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.5383854Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5385012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.5386182Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.5387151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5388355Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5389603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5390598Z kernel = self.compile( 2025-05-07T20:32:16.5391558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5392721Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5393497Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5393898Z 2025-05-07T20:32:16.5394228Z self = 2025-05-07T20:32:16.5395910Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5398184Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96cec665c0>} 2025-05-07T20:32:16.5400267Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5401972Z context = 2025-05-07T20:32:16.5402400Z 2025-05-07T20:32:16.5402633Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5403388Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5404088Z module_map=module_map) 2025-05-07T20:32:16.5404568Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5405051Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.5405411Z E ^ 2025-05-07T20:32:16.5406108Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5406985Z 2025-05-07T20:32:16.5407674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5408479Z 2025-05-07T20:32:16.5408625Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5409233Z self=, 2025-05-07T20:32:16.5409822Z T=4096, 2025-05-07T20:32:16.5410089Z D=7168, 2025-05-07T20:32:16.5410364Z scale_ub=None, 2025-05-07T20:32:16.5410678Z contiguous=False, 2025-05-07T20:32:16.5411011Z compiled=False, 2025-05-07T20:32:16.5411316Z ) 2025-05-07T20:32:16.5411888Z self = 2025-05-07T20:32:16.5423298Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:16.5423754Z 2025-05-07T20:32:16.5424112Z @given( 2025-05-07T20:32:16.5424449Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5424900Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5425325Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5425796Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5426265Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5426667Z ) 2025-05-07T20:32:16.5427166Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5427810Z def test_silu_mul_quant( 2025-05-07T20:32:16.5428148Z self, 2025-05-07T20:32:16.5428422Z T: int, 2025-05-07T20:32:16.5428698Z D: int, 2025-05-07T20:32:16.5428998Z scale_ub: Optional[float], 2025-05-07T20:32:16.5429385Z contiguous: bool, 2025-05-07T20:32:16.5429725Z compiled: bool, 2025-05-07T20:32:16.5430042Z ) -> None: 2025-05-07T20:32:16.5430345Z torch.manual_seed(2025) 2025-05-07T20:32:16.5430691Z 2025-05-07T20:32:16.5431074Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5431557Z 2025-05-07T20:32:16.5431831Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5432236Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5432679Z x = x_sign * x_clamp 2025-05-07T20:32:16.5433119Z x0 = x[:, :D] 2025-05-07T20:32:16.5433420Z x1 = x[:, D:] 2025-05-07T20:32:16.5433707Z 2025-05-07T20:32:16.5433956Z if contiguous: 2025-05-07T20:32:16.5434276Z x0 = x0.contiguous() 2025-05-07T20:32:16.5434638Z x1 = x1.contiguous() 2025-05-07T20:32:16.5434967Z 2025-05-07T20:32:16.5435231Z if scale_ub is not None: 2025-05-07T20:32:16.5435710Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5436175Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5436613Z ) 2025-05-07T20:32:16.5436887Z else: 2025-05-07T20:32:16.5437169Z scale_ub_tensor = None 2025-05-07T20:32:16.5437525Z 2025-05-07T20:32:16.5437843Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5438279Z op = silu_mul_quant 2025-05-07T20:32:16.5438629Z if compiled: 2025-05-07T20:32:16.5438980Z op = torch.compile(op) 2025-05-07T20:32:16.5439388Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5439777Z 2025-05-07T20:32:16.5440041Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.5440272Z 2025-05-07T20:32:16.5440415Z moe/activation_test.py:117: 2025-05-07T20:32:16.5440826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5441308Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.5441700Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5442701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.5443701Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.5444483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5445475Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5446435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5447209Z kernel = self.compile( 2025-05-07T20:32:16.5447987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5448932Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5449506Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5449850Z 2025-05-07T20:32:16.5450138Z self = 2025-05-07T20:32:16.5451950Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5453997Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96cec676a0>} 2025-05-07T20:32:16.5455959Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5457505Z context = 2025-05-07T20:32:16.5457922Z 2025-05-07T20:32:16.5458161Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5458918Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5459579Z module_map=module_map) 2025-05-07T20:32:16.5460082Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5460577Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.5460933Z E ^ 2025-05-07T20:32:16.5461600Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5462321Z 2025-05-07T20:32:16.5462931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5463679Z 2025-05-07T20:32:16.5463831Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5464472Z self=, 2025-05-07T20:32:16.5465048Z T=128, 2025-05-07T20:32:16.5465311Z D=7168, 2025-05-07T20:32:16.5465571Z scale_ub=None, 2025-05-07T20:32:16.5465876Z contiguous=False, 2025-05-07T20:32:16.5466189Z compiled=True, 2025-05-07T20:32:16.5466463Z ) 2025-05-07T20:32:16.5466911Z self = 2025-05-07T20:32:16.5467612Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:16.5467994Z 2025-05-07T20:32:16.5468108Z @given( 2025-05-07T20:32:16.5468419Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5468865Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5469300Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5469761Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5470232Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5470642Z ) 2025-05-07T20:32:16.5471140Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5471779Z def test_silu_mul_quant( 2025-05-07T20:32:16.5472115Z self, 2025-05-07T20:32:16.5472385Z T: int, 2025-05-07T20:32:16.5472662Z D: int, 2025-05-07T20:32:16.5472969Z scale_ub: Optional[float], 2025-05-07T20:32:16.5473351Z contiguous: bool, 2025-05-07T20:32:16.5473679Z compiled: bool, 2025-05-07T20:32:16.5473993Z ) -> None: 2025-05-07T20:32:16.5474291Z torch.manual_seed(2025) 2025-05-07T20:32:16.5474624Z 2025-05-07T20:32:16.5475005Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5475489Z 2025-05-07T20:32:16.5475748Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5476154Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5476599Z x = x_sign * x_clamp 2025-05-07T20:32:16.5476931Z x0 = x[:, :D] 2025-05-07T20:32:16.5477268Z x1 = x[:, D:] 2025-05-07T20:32:16.5477583Z 2025-05-07T20:32:16.5477830Z if contiguous: 2025-05-07T20:32:16.5478156Z x0 = x0.contiguous() 2025-05-07T20:32:16.5478621Z x1 = x1.contiguous() 2025-05-07T20:32:16.5478955Z 2025-05-07T20:32:16.5479222Z if scale_ub is not None: 2025-05-07T20:32:16.5479602Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5480064Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5480499Z ) 2025-05-07T20:32:16.5480763Z else: 2025-05-07T20:32:16.5481059Z scale_ub_tensor = None 2025-05-07T20:32:16.5481412Z 2025-05-07T20:32:16.5481732Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5482180Z op = silu_mul_quant 2025-05-07T20:32:16.5482523Z if compiled: 2025-05-07T20:32:16.5482867Z op = torch.compile(op) 2025-05-07T20:32:16.5483283Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5483668Z 2025-05-07T20:32:16.5483933Z y_fp8, y_scale = fn() 2025-05-07T20:32:16.5484328Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:16.5484732Z 2025-05-07T20:32:16.5485067Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5485544Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:16.5485955Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:16.5486427Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:16.5486975Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.5487539Z 2025-05-07T20:32:16.5487837Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.5488135Z 2025-05-07T20:32:16.5488286Z moe/activation_test.py:126: 2025-05-07T20:32:16.5488730Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5489235Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:16.5489803Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.5491081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:16.5492392Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.5493252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5494323Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5495426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:16.5496575Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.5497758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:16.5498761Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.5499717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:16.5500525Z fn() 2025-05-07T20:32:16.5501338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:16.5502251Z self.fn.run( 2025-05-07T20:32:16.5502970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5503804Z kernel = self.compile( 2025-05-07T20:32:16.5504658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5505714Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5506572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5506955Z 2025-05-07T20:32:16.5507281Z self = 2025-05-07T20:32:16.5509178Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5511443Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96ceaf39c0>} 2025-05-07T20:32:16.5513595Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5515222Z context = 2025-05-07T20:32:16.5515682Z 2025-05-07T20:32:16.5515937Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5516764Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5517493Z module_map=module_map) 2025-05-07T20:32:16.5518067Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5518607Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.5519010Z E ^ 2025-05-07T20:32:16.5519715Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5520454Z 2025-05-07T20:32:16.5521123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5522103Z 2025-05-07T20:32:16.5522274Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5522907Z self=, 2025-05-07T20:32:16.5523530Z T=128, 2025-05-07T20:32:16.5523910Z D=7168, 2025-05-07T20:32:16.5524191Z scale_ub=None, 2025-05-07T20:32:16.5524503Z contiguous=False, 2025-05-07T20:32:16.5524838Z compiled=False, 2025-05-07T20:32:16.5525134Z ) 2025-05-07T20:32:16.5525630Z self = 2025-05-07T20:32:16.5526395Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:16.5526812Z 2025-05-07T20:32:16.5526937Z @given( 2025-05-07T20:32:16.5527321Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5527791Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5528251Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5528768Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5529281Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5529713Z ) 2025-05-07T20:32:16.5530255Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5530955Z def test_silu_mul_quant( 2025-05-07T20:32:16.5531316Z self, 2025-05-07T20:32:16.5531610Z T: int, 2025-05-07T20:32:16.5532009Z D: int, 2025-05-07T20:32:16.5532326Z scale_ub: Optional[float], 2025-05-07T20:32:16.5532737Z contiguous: bool, 2025-05-07T20:32:16.5533095Z compiled: bool, 2025-05-07T20:32:16.5533416Z ) -> None: 2025-05-07T20:32:16.5533718Z torch.manual_seed(2025) 2025-05-07T20:32:16.5534062Z 2025-05-07T20:32:16.5534433Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5534944Z 2025-05-07T20:32:16.5535211Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5535623Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5536074Z x = x_sign * x_clamp 2025-05-07T20:32:16.5536460Z x0 = x[:, :D] 2025-05-07T20:32:16.5536789Z x1 = x[:, D:] 2025-05-07T20:32:16.5537096Z 2025-05-07T20:32:16.5537399Z if contiguous: 2025-05-07T20:32:16.5537774Z x0 = x0.contiguous() 2025-05-07T20:32:16.5538152Z x1 = x1.contiguous() 2025-05-07T20:32:16.5538500Z 2025-05-07T20:32:16.5538781Z if scale_ub is not None: 2025-05-07T20:32:16.5539339Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5539849Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5540286Z ) 2025-05-07T20:32:16.5540563Z else: 2025-05-07T20:32:16.5540883Z scale_ub_tensor = None 2025-05-07T20:32:16.5541277Z 2025-05-07T20:32:16.5541602Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5542120Z op = silu_mul_quant 2025-05-07T20:32:16.5542508Z if compiled: 2025-05-07T20:32:16.5542876Z op = torch.compile(op) 2025-05-07T20:32:16.5543342Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5543777Z 2025-05-07T20:32:16.5544043Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.5544296Z 2025-05-07T20:32:16.5544437Z moe/activation_test.py:117: 2025-05-07T20:32:16.5544870Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5545347Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.5545738Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5546753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.5547969Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.5548764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5549915Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5550906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5551698Z kernel = self.compile( 2025-05-07T20:32:16.5552494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5553539Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5554118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5554454Z 2025-05-07T20:32:16.5554759Z self = 2025-05-07T20:32:16.5556374Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5558438Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96cf259800>} 2025-05-07T20:32:16.5560473Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5561990Z context = 2025-05-07T20:32:16.5562422Z 2025-05-07T20:32:16.5562656Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5563423Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5564113Z module_map=module_map) 2025-05-07T20:32:16.5564627Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5565135Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.5565512Z E ^ 2025-05-07T20:32:16.5566195Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5566874Z 2025-05-07T20:32:16.5567496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5568273Z 2025-05-07T20:32:16.5568416Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5569143Z self=, 2025-05-07T20:32:16.5569743Z T=4096, 2025-05-07T20:32:16.5570006Z D=5120, 2025-05-07T20:32:16.5570285Z scale_ub=1200.0, 2025-05-07T20:32:16.5570604Z contiguous=True, 2025-05-07T20:32:16.5570917Z compiled=False, 2025-05-07T20:32:16.5571211Z ) 2025-05-07T20:32:16.5571672Z self = 2025-05-07T20:32:16.5572513Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.5572923Z 2025-05-07T20:32:16.5573031Z @given( 2025-05-07T20:32:16.5573354Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5573806Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5574236Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5574708Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5575189Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5575594Z ) 2025-05-07T20:32:16.5576974Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5577681Z def test_silu_mul_quant( 2025-05-07T20:32:16.5578020Z self, 2025-05-07T20:32:16.5578292Z T: int, 2025-05-07T20:32:16.5578583Z D: int, 2025-05-07T20:32:16.5578880Z scale_ub: Optional[float], 2025-05-07T20:32:16.5579275Z contiguous: bool, 2025-05-07T20:32:16.5579703Z compiled: bool, 2025-05-07T20:32:16.5580029Z ) -> None: 2025-05-07T20:32:16.5580330Z torch.manual_seed(2025) 2025-05-07T20:32:16.5580686Z 2025-05-07T20:32:16.5581079Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5581577Z 2025-05-07T20:32:16.5581861Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5582327Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5582762Z x = x_sign * x_clamp 2025-05-07T20:32:16.5583102Z x0 = x[:, :D] 2025-05-07T20:32:16.5583413Z x1 = x[:, D:] 2025-05-07T20:32:16.5583711Z 2025-05-07T20:32:16.5583975Z if contiguous: 2025-05-07T20:32:16.5584305Z x0 = x0.contiguous() 2025-05-07T20:32:16.5584662Z x1 = x1.contiguous() 2025-05-07T20:32:16.5585006Z 2025-05-07T20:32:16.5585282Z if scale_ub is not None: 2025-05-07T20:32:16.5585664Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5586147Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5586597Z ) 2025-05-07T20:32:16.5586865Z else: 2025-05-07T20:32:16.5587150Z scale_ub_tensor = None 2025-05-07T20:32:16.5587511Z 2025-05-07T20:32:16.5587844Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5588289Z op = silu_mul_quant 2025-05-07T20:32:16.5588643Z if compiled: 2025-05-07T20:32:16.5588991Z op = torch.compile(op) 2025-05-07T20:32:16.5589403Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5589797Z 2025-05-07T20:32:16.5590072Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.5590302Z 2025-05-07T20:32:16.5590441Z moe/activation_test.py:117: 2025-05-07T20:32:16.5590860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5591341Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.5591732Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5592756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.5593784Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.5594567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5595578Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5596564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5597454Z kernel = self.compile( 2025-05-07T20:32:16.5598245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5599210Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5599785Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5600123Z 2025-05-07T20:32:16.5600418Z self = 2025-05-07T20:32:16.5601591Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5602358Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96d49fb380>} 2025-05-07T20:32:16.5603468Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5603743Z context = 2025-05-07T20:32:16.5603759Z 2025-05-07T20:32:16.5603993Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5604446Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5604602Z module_map=module_map) 2025-05-07T20:32:16.5604823Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5605020Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.5605136Z E ^ 2025-05-07T20:32:16.5605665Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5605678Z 2025-05-07T20:32:16.5606544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5606553Z 2025-05-07T20:32:16.5606701Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5607019Z self=, 2025-05-07T20:32:16.5607141Z T=1, 2025-05-07T20:32:16.5607263Z D=5120, 2025-05-07T20:32:16.5607395Z scale_ub=None, 2025-05-07T20:32:16.5607548Z contiguous=True, 2025-05-07T20:32:16.5607663Z compiled=True, 2025-05-07T20:32:16.5607762Z ) 2025-05-07T20:32:16.5608089Z self = 2025-05-07T20:32:16.5608328Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.5608339Z 2025-05-07T20:32:16.5608452Z @given( 2025-05-07T20:32:16.5608620Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5608766Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5608934Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5609103Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5609264Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5609377Z ) 2025-05-07T20:32:16.5609744Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5609883Z def test_silu_mul_quant( 2025-05-07T20:32:16.5609989Z self, 2025-05-07T20:32:16.5610098Z T: int, 2025-05-07T20:32:16.5610214Z D: int, 2025-05-07T20:32:16.5610347Z scale_ub: Optional[float], 2025-05-07T20:32:16.5610470Z contiguous: bool, 2025-05-07T20:32:16.5610606Z compiled: bool, 2025-05-07T20:32:16.5610723Z ) -> None: 2025-05-07T20:32:16.5610854Z torch.manual_seed(2025) 2025-05-07T20:32:16.5610967Z 2025-05-07T20:32:16.5611207Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5611559Z 2025-05-07T20:32:16.5611699Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5611978Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5612104Z x = x_sign * x_clamp 2025-05-07T20:32:16.5612227Z x0 = x[:, :D] 2025-05-07T20:32:16.5612339Z x1 = x[:, D:] 2025-05-07T20:32:16.5612449Z 2025-05-07T20:32:16.5612569Z if contiguous: 2025-05-07T20:32:16.5612701Z x0 = x0.contiguous() 2025-05-07T20:32:16.5612834Z x1 = x1.contiguous() 2025-05-07T20:32:16.5612935Z 2025-05-07T20:32:16.5613061Z if scale_ub is not None: 2025-05-07T20:32:16.5613211Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5613398Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5613510Z ) 2025-05-07T20:32:16.5613629Z else: 2025-05-07T20:32:16.5613758Z scale_ub_tensor = None 2025-05-07T20:32:16.5613863Z 2025-05-07T20:32:16.5614062Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5614185Z op = silu_mul_quant 2025-05-07T20:32:16.5614312Z if compiled: 2025-05-07T20:32:16.5614449Z op = torch.compile(op) 2025-05-07T20:32:16.5614596Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5614708Z 2025-05-07T20:32:16.5614832Z y_fp8, y_scale = fn() 2025-05-07T20:32:16.5615090Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:16.5615201Z 2025-05-07T20:32:16.5615394Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5615533Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:16.5615679Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:16.5615938Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:16.5616136Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.5616245Z 2025-05-07T20:32:16.5616389Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.5616396Z 2025-05-07T20:32:16.5616541Z moe/activation_test.py:126: 2025-05-07T20:32:16.5616724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5616875Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:16.5617069Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.5617898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:16.5618045Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.5618587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5618914Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5619470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:16.5619847Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.5620403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:16.5620645Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.5621153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:16.5621273Z fn() 2025-05-07T20:32:16.5621864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:16.5621979Z self.fn.run( 2025-05-07T20:32:16.5622488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5622624Z kernel = self.compile( 2025-05-07T20:32:16.5623554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5623823Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5624003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5624011Z 2025-05-07T20:32:16.5624311Z self = 2025-05-07T20:32:16.5625471Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5626224Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96ce7c0720>} 2025-05-07T20:32:16.5627373Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5627653Z context = 2025-05-07T20:32:16.5627661Z 2025-05-07T20:32:16.5627903Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5628286Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5628537Z module_map=module_map) 2025-05-07T20:32:16.5628758Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5628904Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.5629021Z E ^ 2025-05-07T20:32:16.5629536Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5629602Z 2025-05-07T20:32:16.5630236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5630248Z 2025-05-07T20:32:16.5630407Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5630730Z self=, 2025-05-07T20:32:16.5647153Z T=2048, 2025-05-07T20:32:16.5647346Z D=5120, 2025-05-07T20:32:16.5647541Z scale_ub=None, 2025-05-07T20:32:16.5647693Z contiguous=True, 2025-05-07T20:32:16.5647895Z compiled=True, 2025-05-07T20:32:16.5648045Z ) 2025-05-07T20:32:16.5648485Z self = 2025-05-07T20:32:16.5648776Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.5648784Z 2025-05-07T20:32:16.5648906Z @given( 2025-05-07T20:32:16.5649086Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5649239Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5649399Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5649582Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5649757Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5649865Z ) 2025-05-07T20:32:16.5650226Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5650374Z def test_silu_mul_quant( 2025-05-07T20:32:16.5650483Z self, 2025-05-07T20:32:16.5650593Z T: int, 2025-05-07T20:32:16.5650718Z D: int, 2025-05-07T20:32:16.5650871Z scale_ub: Optional[float], 2025-05-07T20:32:16.5651008Z contiguous: bool, 2025-05-07T20:32:16.5651139Z compiled: bool, 2025-05-07T20:32:16.5651259Z ) -> None: 2025-05-07T20:32:16.5651405Z torch.manual_seed(2025) 2025-05-07T20:32:16.5651511Z 2025-05-07T20:32:16.5651878Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5652000Z 2025-05-07T20:32:16.5652131Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5652573Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5652724Z x = x_sign * x_clamp 2025-05-07T20:32:16.5652881Z x0 = x[:, :D] 2025-05-07T20:32:16.5652997Z x1 = x[:, D:] 2025-05-07T20:32:16.5653120Z 2025-05-07T20:32:16.5653279Z if contiguous: 2025-05-07T20:32:16.5653413Z x0 = x0.contiguous() 2025-05-07T20:32:16.5653542Z x1 = x1.contiguous() 2025-05-07T20:32:16.5653662Z 2025-05-07T20:32:16.5653791Z if scale_ub is not None: 2025-05-07T20:32:16.5653936Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5654129Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5654239Z ) 2025-05-07T20:32:16.5654345Z else: 2025-05-07T20:32:16.5654487Z scale_ub_tensor = None 2025-05-07T20:32:16.5654597Z 2025-05-07T20:32:16.5654790Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5654914Z op = silu_mul_quant 2025-05-07T20:32:16.5655030Z if compiled: 2025-05-07T20:32:16.5655191Z op = torch.compile(op) 2025-05-07T20:32:16.5655336Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5655439Z 2025-05-07T20:32:16.5655578Z y_fp8, y_scale = fn() 2025-05-07T20:32:16.5655754Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:16.5655860Z 2025-05-07T20:32:16.5656064Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5656289Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:16.5656432Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:16.5656618Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:16.5656816Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.5656991Z 2025-05-07T20:32:16.5657138Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.5657146Z 2025-05-07T20:32:16.5657292Z moe/activation_test.py:126: 2025-05-07T20:32:16.5657498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5657650Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:16.5657845Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.5658700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:16.5658849Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.5659400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5659733Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5660281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:16.5660673Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.5661235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:16.5661489Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.5662001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:16.5662114Z fn() 2025-05-07T20:32:16.5662733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:16.5662858Z self.fn.run( 2025-05-07T20:32:16.5663410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5663556Z kernel = self.compile( 2025-05-07T20:32:16.5664173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5664479Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5664867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5664877Z 2025-05-07T20:32:16.5665277Z self = 2025-05-07T20:32:16.5670893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5671707Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96cf3f05e0>} 2025-05-07T20:32:16.5672815Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5673098Z context = 2025-05-07T20:32:16.5673106Z 2025-05-07T20:32:16.5673343Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5673727Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5673874Z module_map=module_map) 2025-05-07T20:32:16.5674101Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5674375Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.5674481Z E ^ 2025-05-07T20:32:16.5675041Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5675049Z 2025-05-07T20:32:16.5675656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5675722Z 2025-05-07T20:32:16.5675867Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5676189Z self=, 2025-05-07T20:32:16.5676295Z T=128, 2025-05-07T20:32:16.5676404Z D=5120, 2025-05-07T20:32:16.5676517Z scale_ub=None, 2025-05-07T20:32:16.5697461Z contiguous=True, 2025-05-07T20:32:16.5697621Z compiled=True, 2025-05-07T20:32:16.5697727Z ) 2025-05-07T20:32:16.5698065Z self = 2025-05-07T20:32:16.5698313Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.5698320Z 2025-05-07T20:32:16.5698429Z @given( 2025-05-07T20:32:16.5698598Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5698741Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5698900Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5699078Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5699237Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5699343Z ) 2025-05-07T20:32:16.5699718Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5699851Z def test_silu_mul_quant( 2025-05-07T20:32:16.5699973Z self, 2025-05-07T20:32:16.5700079Z T: int, 2025-05-07T20:32:16.5700186Z D: int, 2025-05-07T20:32:16.5700325Z scale_ub: Optional[float], 2025-05-07T20:32:16.5700452Z contiguous: bool, 2025-05-07T20:32:16.5700573Z compiled: bool, 2025-05-07T20:32:16.5700693Z ) -> None: 2025-05-07T20:32:16.5700829Z torch.manual_seed(2025) 2025-05-07T20:32:16.5700927Z 2025-05-07T20:32:16.5701164Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5701276Z 2025-05-07T20:32:16.5701401Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5701582Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5701712Z x = x_sign * x_clamp 2025-05-07T20:32:16.5701824Z x0 = x[:, :D] 2025-05-07T20:32:16.5702110Z x1 = x[:, D:] 2025-05-07T20:32:16.5702225Z 2025-05-07T20:32:16.5702345Z if contiguous: 2025-05-07T20:32:16.5702477Z x0 = x0.contiguous() 2025-05-07T20:32:16.5702612Z x1 = x1.contiguous() 2025-05-07T20:32:16.5702718Z 2025-05-07T20:32:16.5702855Z if scale_ub is not None: 2025-05-07T20:32:16.5703003Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5703198Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5703313Z ) 2025-05-07T20:32:16.5703423Z else: 2025-05-07T20:32:16.5703558Z scale_ub_tensor = None 2025-05-07T20:32:16.5703671Z 2025-05-07T20:32:16.5703854Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5703983Z op = silu_mul_quant 2025-05-07T20:32:16.5704108Z if compiled: 2025-05-07T20:32:16.5704246Z op = torch.compile(op) 2025-05-07T20:32:16.5704395Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5704511Z 2025-05-07T20:32:16.5704636Z y_fp8, y_scale = fn() 2025-05-07T20:32:16.5704820Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:16.5704922Z 2025-05-07T20:32:16.5705117Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5705273Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:16.5705415Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:16.5705661Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:16.5705872Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.5705985Z 2025-05-07T20:32:16.5706421Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.5706430Z 2025-05-07T20:32:16.5706804Z moe/activation_test.py:126: 2025-05-07T20:32:16.5707018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5707200Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:16.5707404Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.5708236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:16.5708387Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.5708917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5709243Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5709791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:16.5710163Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.5710728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:16.5710968Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.5711478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:16.5711600Z fn() 2025-05-07T20:32:16.5712194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:16.5712314Z self.fn.run( 2025-05-07T20:32:16.5712812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5712946Z kernel = self.compile( 2025-05-07T20:32:16.5713533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5713785Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5713973Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5713979Z 2025-05-07T20:32:16.5714487Z self = 2025-05-07T20:32:16.5715675Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5716440Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9639a92520>} 2025-05-07T20:32:16.5717596Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5717877Z context = 2025-05-07T20:32:16.5717883Z 2025-05-07T20:32:16.5718116Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5718513Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5718668Z module_map=module_map) 2025-05-07T20:32:16.5718893Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5719034Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.5719151Z E ^ 2025-05-07T20:32:16.5719665Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5719757Z 2025-05-07T20:32:16.5720377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5720384Z 2025-05-07T20:32:16.5720530Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5720904Z self=, 2025-05-07T20:32:16.5721023Z T=4096, 2025-05-07T20:32:16.5721131Z D=5120, 2025-05-07T20:32:16.5721262Z scale_ub=None, 2025-05-07T20:32:16.5721385Z contiguous=True, 2025-05-07T20:32:16.5721501Z compiled=True, 2025-05-07T20:32:16.5721612Z ) 2025-05-07T20:32:16.5721929Z self = 2025-05-07T20:32:16.5722174Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.5722181Z 2025-05-07T20:32:16.5722303Z @given( 2025-05-07T20:32:16.5722466Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5722598Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5722762Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5722929Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5723104Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5723210Z ) 2025-05-07T20:32:16.5723566Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5723704Z def test_silu_mul_quant( 2025-05-07T20:32:16.5723817Z self, 2025-05-07T20:32:16.5723932Z T: int, 2025-05-07T20:32:16.5724050Z D: int, 2025-05-07T20:32:16.5724189Z scale_ub: Optional[float], 2025-05-07T20:32:16.5724315Z contiguous: bool, 2025-05-07T20:32:16.5724451Z compiled: bool, 2025-05-07T20:32:16.5724564Z ) -> None: 2025-05-07T20:32:16.5724694Z torch.manual_seed(2025) 2025-05-07T20:32:16.5724807Z 2025-05-07T20:32:16.5725042Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5725151Z 2025-05-07T20:32:16.5725292Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5725465Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5725596Z x = x_sign * x_clamp 2025-05-07T20:32:16.5725715Z x0 = x[:, :D] 2025-05-07T20:32:16.5725829Z x1 = x[:, D:] 2025-05-07T20:32:16.5725940Z 2025-05-07T20:32:16.5726063Z if contiguous: 2025-05-07T20:32:16.5726196Z x0 = x0.contiguous() 2025-05-07T20:32:16.5726444Z x1 = x1.contiguous() 2025-05-07T20:32:16.5726554Z 2025-05-07T20:32:16.5726681Z if scale_ub is not None: 2025-05-07T20:32:16.5726835Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5727028Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5727156Z ) 2025-05-07T20:32:16.5727280Z else: 2025-05-07T20:32:16.5727438Z scale_ub_tensor = None 2025-05-07T20:32:16.5727553Z 2025-05-07T20:32:16.5727731Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5727853Z op = silu_mul_quant 2025-05-07T20:32:16.5727977Z if compiled: 2025-05-07T20:32:16.5728114Z op = torch.compile(op) 2025-05-07T20:32:16.5728265Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5728377Z 2025-05-07T20:32:16.5728504Z y_fp8, y_scale = fn() 2025-05-07T20:32:16.5728679Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:16.5728787Z 2025-05-07T20:32:16.5728975Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5729122Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:16.5729270Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:16.5729447Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:16.5729653Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.5729818Z 2025-05-07T20:32:16.5729959Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.5729965Z 2025-05-07T20:32:16.5730108Z moe/activation_test.py:126: 2025-05-07T20:32:16.5730290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5730494Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:16.5730694Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.5731529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:16.5731679Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.5732292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5732616Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5733174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:16.5733540Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.5734102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:16.5734342Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.5734856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:16.5734972Z fn() 2025-05-07T20:32:16.5735563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:16.5735680Z self.fn.run( 2025-05-07T20:32:16.5736180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5736310Z kernel = self.compile( 2025-05-07T20:32:16.5736875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5737126Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5737314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5737324Z 2025-05-07T20:32:16.5737655Z self = 2025-05-07T20:32:16.5738948Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5739712Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9639a55300>} 2025-05-07T20:32:16.5740823Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5741110Z context = 2025-05-07T20:32:16.5741117Z 2025-05-07T20:32:16.5741367Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5741765Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5741931Z module_map=module_map) 2025-05-07T20:32:16.5742162Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5742308Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.5742423Z E ^ 2025-05-07T20:32:16.5742954Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5742961Z 2025-05-07T20:32:16.5743653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5743668Z 2025-05-07T20:32:16.5743818Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5744138Z self=, 2025-05-07T20:32:16.5744323Z T=16384, 2025-05-07T20:32:16.5744430Z D=5120, 2025-05-07T20:32:16.5744543Z scale_ub=None, 2025-05-07T20:32:16.5744670Z contiguous=True, 2025-05-07T20:32:16.5744790Z compiled=True, 2025-05-07T20:32:16.5744892Z ) 2025-05-07T20:32:16.5745231Z self = 2025-05-07T20:32:16.5745478Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.5745484Z 2025-05-07T20:32:16.5745599Z @given( 2025-05-07T20:32:16.5745762Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5745907Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5746083Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5746248Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5746405Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5746518Z ) 2025-05-07T20:32:16.5746875Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5747012Z def test_silu_mul_quant( 2025-05-07T20:32:16.5747139Z self, 2025-05-07T20:32:16.5747267Z T: int, 2025-05-07T20:32:16.5747405Z D: int, 2025-05-07T20:32:16.5747548Z scale_ub: Optional[float], 2025-05-07T20:32:16.5747676Z contiguous: bool, 2025-05-07T20:32:16.5747801Z compiled: bool, 2025-05-07T20:32:16.5747912Z ) -> None: 2025-05-07T20:32:16.5748046Z torch.manual_seed(2025) 2025-05-07T20:32:16.5748154Z 2025-05-07T20:32:16.5748379Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5748486Z 2025-05-07T20:32:16.5748622Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5748794Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5748915Z x = x_sign * x_clamp 2025-05-07T20:32:16.5749032Z x0 = x[:, :D] 2025-05-07T20:32:16.5749150Z x1 = x[:, D:] 2025-05-07T20:32:16.5749256Z 2025-05-07T20:32:16.5749385Z if contiguous: 2025-05-07T20:32:16.5749508Z x0 = x0.contiguous() 2025-05-07T20:32:16.5749638Z x1 = x1.contiguous() 2025-05-07T20:32:16.5749742Z 2025-05-07T20:32:16.5749974Z if scale_ub is not None: 2025-05-07T20:32:16.5750134Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5750327Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5750436Z ) 2025-05-07T20:32:16.5750555Z else: 2025-05-07T20:32:16.5750685Z scale_ub_tensor = None 2025-05-07T20:32:16.5750785Z 2025-05-07T20:32:16.5750973Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5751104Z op = silu_mul_quant 2025-05-07T20:32:16.5751225Z if compiled: 2025-05-07T20:32:16.5751368Z op = torch.compile(op) 2025-05-07T20:32:16.5751508Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5751612Z 2025-05-07T20:32:16.5751733Z y_fp8, y_scale = fn() 2025-05-07T20:32:16.5751896Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:16.5751997Z 2025-05-07T20:32:16.5752177Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5752326Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:16.5752471Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:16.5752639Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:16.5752828Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.5752933Z 2025-05-07T20:32:16.5753069Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.5753145Z 2025-05-07T20:32:16.5753282Z moe/activation_test.py:126: 2025-05-07T20:32:16.5753454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5753593Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:16.5753788Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.5754714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:16.5754858Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.5755431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5755768Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5756342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:16.5756733Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.5757324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:16.5757583Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.5758102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:16.5758215Z fn() 2025-05-07T20:32:16.5758854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:16.5758975Z self.fn.run( 2025-05-07T20:32:16.5759520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5759659Z kernel = self.compile( 2025-05-07T20:32:16.5760216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5760464Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5760642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5760650Z 2025-05-07T20:32:16.5760948Z self = 2025-05-07T20:32:16.5762335Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5763144Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96391f8e00>} 2025-05-07T20:32:16.5764332Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5764624Z context = 2025-05-07T20:32:16.5764632Z 2025-05-07T20:32:16.5764875Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5765279Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5765436Z module_map=module_map) 2025-05-07T20:32:16.5765668Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5765822Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.5765936Z E ^ 2025-05-07T20:32:16.5766472Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5766481Z 2025-05-07T20:32:16.5767144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5767230Z 2025-05-07T20:32:16.5767415Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5767758Z self=, 2025-05-07T20:32:16.5767880Z T=1, 2025-05-07T20:32:16.5767990Z D=5120, 2025-05-07T20:32:16.5768106Z scale_ub=1200.0, 2025-05-07T20:32:16.5768230Z contiguous=True, 2025-05-07T20:32:16.5768407Z compiled=True, 2025-05-07T20:32:16.5768505Z ) 2025-05-07T20:32:16.5768827Z self = 2025-05-07T20:32:16.5769073Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.5769080Z 2025-05-07T20:32:16.5769191Z @given( 2025-05-07T20:32:16.5769366Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5769509Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5769677Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5769852Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5770017Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5770130Z ) 2025-05-07T20:32:16.5770507Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5770651Z def test_silu_mul_quant( 2025-05-07T20:32:16.5770766Z self, 2025-05-07T20:32:16.5770875Z T: int, 2025-05-07T20:32:16.5770985Z D: int, 2025-05-07T20:32:16.5771131Z scale_ub: Optional[float], 2025-05-07T20:32:16.5771256Z contiguous: bool, 2025-05-07T20:32:16.5771374Z compiled: bool, 2025-05-07T20:32:16.5771508Z ) -> None: 2025-05-07T20:32:16.5771645Z torch.manual_seed(2025) 2025-05-07T20:32:16.5771749Z 2025-05-07T20:32:16.5772157Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5772265Z 2025-05-07T20:32:16.5772404Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5772581Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5772713Z x = x_sign * x_clamp 2025-05-07T20:32:16.5772834Z x0 = x[:, :D] 2025-05-07T20:32:16.5772945Z x1 = x[:, D:] 2025-05-07T20:32:16.5773042Z 2025-05-07T20:32:16.5773168Z if contiguous: 2025-05-07T20:32:16.5773294Z x0 = x0.contiguous() 2025-05-07T20:32:16.5773414Z x1 = x1.contiguous() 2025-05-07T20:32:16.5773527Z 2025-05-07T20:32:16.5773652Z if scale_ub is not None: 2025-05-07T20:32:16.5773794Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5774097Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5774206Z ) 2025-05-07T20:32:16.5774318Z else: 2025-05-07T20:32:16.5774447Z scale_ub_tensor = None 2025-05-07T20:32:16.5774544Z 2025-05-07T20:32:16.5774733Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5774856Z op = silu_mul_quant 2025-05-07T20:32:16.5774971Z if compiled: 2025-05-07T20:32:16.5775121Z op = torch.compile(op) 2025-05-07T20:32:16.5775268Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5775370Z 2025-05-07T20:32:16.5775498Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.5775504Z 2025-05-07T20:32:16.5775640Z moe/activation_test.py:117: 2025-05-07T20:32:16.5775831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5775980Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.5776117Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5776725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.5776862Z return fn(*args, **kwargs) 2025-05-07T20:32:16.5777689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.5777837Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.5778385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5778802Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5779400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5779701Z kernel = self.compile( 2025-05-07T20:32:16.5780311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5780571Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5780763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5780771Z 2025-05-07T20:32:16.5781067Z self = 2025-05-07T20:32:16.5782295Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5783123Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96394cccc0>} 2025-05-07T20:32:16.5784311Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5784619Z context = 2025-05-07T20:32:16.5784627Z 2025-05-07T20:32:16.5784878Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5785306Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5785465Z module_map=module_map) 2025-05-07T20:32:16.5785708Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5785865Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.5785977Z E ^ 2025-05-07T20:32:16.5786529Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5786552Z 2025-05-07T20:32:16.5787190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5787198Z 2025-05-07T20:32:16.5787351Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5787793Z self=, 2025-05-07T20:32:16.5787910Z T=1, 2025-05-07T20:32:16.5788016Z D=5120, 2025-05-07T20:32:16.5788137Z scale_ub=None, 2025-05-07T20:32:16.5788259Z contiguous=False, 2025-05-07T20:32:16.5788379Z compiled=True, 2025-05-07T20:32:16.5788489Z ) 2025-05-07T20:32:16.5788813Z self = 2025-05-07T20:32:16.5789066Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:16.5789074Z 2025-05-07T20:32:16.5789180Z @given( 2025-05-07T20:32:16.5789346Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5789494Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5789664Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5789831Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5790012Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5790120Z ) 2025-05-07T20:32:16.5790498Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5790641Z def test_silu_mul_quant( 2025-05-07T20:32:16.5790749Z self, 2025-05-07T20:32:16.5790865Z T: int, 2025-05-07T20:32:16.5790973Z D: int, 2025-05-07T20:32:16.5791108Z scale_ub: Optional[float], 2025-05-07T20:32:16.5791311Z contiguous: bool, 2025-05-07T20:32:16.5791434Z compiled: bool, 2025-05-07T20:32:16.5791544Z ) -> None: 2025-05-07T20:32:16.5791683Z torch.manual_seed(2025) 2025-05-07T20:32:16.5791788Z 2025-05-07T20:32:16.5792040Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5792209Z 2025-05-07T20:32:16.5792343Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5792519Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5792654Z x = x_sign * x_clamp 2025-05-07T20:32:16.5792774Z x0 = x[:, :D] 2025-05-07T20:32:16.5792883Z x1 = x[:, D:] 2025-05-07T20:32:16.5792992Z 2025-05-07T20:32:16.5793110Z if contiguous: 2025-05-07T20:32:16.5793244Z x0 = x0.contiguous() 2025-05-07T20:32:16.5793368Z x1 = x1.contiguous() 2025-05-07T20:32:16.5793473Z 2025-05-07T20:32:16.5793607Z if scale_ub is not None: 2025-05-07T20:32:16.5793756Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5793956Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5794072Z ) 2025-05-07T20:32:16.5794179Z else: 2025-05-07T20:32:16.5794311Z scale_ub_tensor = None 2025-05-07T20:32:16.5794424Z 2025-05-07T20:32:16.5794607Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5794739Z op = silu_mul_quant 2025-05-07T20:32:16.5794861Z if compiled: 2025-05-07T20:32:16.5795002Z op = torch.compile(op) 2025-05-07T20:32:16.5795165Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5795272Z 2025-05-07T20:32:16.5795399Z y_fp8, y_scale = fn() 2025-05-07T20:32:16.5795578Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:16.5795681Z 2025-05-07T20:32:16.5795876Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5796025Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:16.5796169Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:16.5796343Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:16.5796561Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.5796664Z 2025-05-07T20:32:16.5796817Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.5796828Z 2025-05-07T20:32:16.5796967Z moe/activation_test.py:126: 2025-05-07T20:32:16.5797175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5797464Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:16.5797666Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.5798561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:16.5798719Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.5799278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5799610Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5800188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:16.5800584Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.5801198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:16.5801457Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.5801999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:16.5802112Z fn() 2025-05-07T20:32:16.5802730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:16.5802936Z self.fn.run( 2025-05-07T20:32:16.5803465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5803608Z kernel = self.compile( 2025-05-07T20:32:16.5804213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5804573Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5804773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5804788Z 2025-05-07T20:32:16.5805090Z self = 2025-05-07T20:32:16.5806945Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5807629Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96394cf2e0>} 2025-05-07T20:32:16.5808418Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5808628Z context = 2025-05-07T20:32:16.5808634Z 2025-05-07T20:32:16.5808814Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5809089Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5809202Z module_map=module_map) 2025-05-07T20:32:16.5809368Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5809479Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.5809558Z E ^ 2025-05-07T20:32:16.5809928Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5809933Z 2025-05-07T20:32:16.5810368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5810376Z 2025-05-07T20:32:16.5810481Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5810718Z self=, 2025-05-07T20:32:16.5810794Z T=1, 2025-05-07T20:32:16.5811247Z D=5120, 2025-05-07T20:32:16.5811343Z scale_ub=None, 2025-05-07T20:32:16.5811427Z contiguous=True, 2025-05-07T20:32:16.5811509Z compiled=False, 2025-05-07T20:32:16.5811585Z ) 2025-05-07T20:32:16.5811892Z self = 2025-05-07T20:32:16.5812064Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.5812071Z 2025-05-07T20:32:16.5812150Z @given( 2025-05-07T20:32:16.5812268Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5812367Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5812485Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5812599Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5812716Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5812789Z ) 2025-05-07T20:32:16.5813040Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5813144Z def test_silu_mul_quant( 2025-05-07T20:32:16.5813221Z self, 2025-05-07T20:32:16.5813297Z T: int, 2025-05-07T20:32:16.5813376Z D: int, 2025-05-07T20:32:16.5813473Z scale_ub: Optional[float], 2025-05-07T20:32:16.5813560Z contiguous: bool, 2025-05-07T20:32:16.5813651Z compiled: bool, 2025-05-07T20:32:16.5813730Z ) -> None: 2025-05-07T20:32:16.5813910Z torch.manual_seed(2025) 2025-05-07T20:32:16.5813988Z 2025-05-07T20:32:16.5814160Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5814239Z 2025-05-07T20:32:16.5814331Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5814453Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5814614Z x = x_sign * x_clamp 2025-05-07T20:32:16.5814692Z x0 = x[:, :D] 2025-05-07T20:32:16.5814770Z x1 = x[:, D:] 2025-05-07T20:32:16.5814849Z 2025-05-07T20:32:16.5814930Z if contiguous: 2025-05-07T20:32:16.5815026Z x0 = x0.contiguous() 2025-05-07T20:32:16.5815121Z x1 = x1.contiguous() 2025-05-07T20:32:16.5815188Z 2025-05-07T20:32:16.5815276Z if scale_ub is not None: 2025-05-07T20:32:16.5815387Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5815522Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5815606Z ) 2025-05-07T20:32:16.5815679Z else: 2025-05-07T20:32:16.5815772Z scale_ub_tensor = None 2025-05-07T20:32:16.5815845Z 2025-05-07T20:32:16.5815972Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5816060Z op = silu_mul_quant 2025-05-07T20:32:16.5816152Z if compiled: 2025-05-07T20:32:16.5816252Z op = torch.compile(op) 2025-05-07T20:32:16.5816357Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5816435Z 2025-05-07T20:32:16.5816523Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.5816532Z 2025-05-07T20:32:16.5816629Z moe/activation_test.py:117: 2025-05-07T20:32:16.5816765Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5816866Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.5816972Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5817491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.5817593Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.5817972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5818202Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5818561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5818654Z kernel = self.compile( 2025-05-07T20:32:16.5819133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5819318Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5819449Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5819455Z 2025-05-07T20:32:16.5819662Z self = 2025-05-07T20:32:16.5820483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5820998Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96394cfb00>} 2025-05-07T20:32:16.5821791Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5821985Z context = 2025-05-07T20:32:16.5821990Z 2025-05-07T20:32:16.5822165Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5822439Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5822586Z module_map=module_map) 2025-05-07T20:32:16.5822759Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5822858Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.5822975Z E ^ 2025-05-07T20:32:16.5823353Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5823358Z 2025-05-07T20:32:16.5823797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5823801Z 2025-05-07T20:32:16.5823910Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5824140Z self=, 2025-05-07T20:32:16.5824214Z T=128, 2025-05-07T20:32:16.5824294Z D=5120, 2025-05-07T20:32:16.5824376Z scale_ub=None, 2025-05-07T20:32:16.5824463Z contiguous=False, 2025-05-07T20:32:16.5824551Z compiled=True, 2025-05-07T20:32:16.5824619Z ) 2025-05-07T20:32:16.5824850Z self = 2025-05-07T20:32:16.5825027Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:16.5825034Z 2025-05-07T20:32:16.5825113Z @given( 2025-05-07T20:32:16.5825239Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5825339Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5825462Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5825592Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5825703Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5825776Z ) 2025-05-07T20:32:16.5826037Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5826128Z def test_silu_mul_quant( 2025-05-07T20:32:16.5826215Z self, 2025-05-07T20:32:16.5826288Z T: int, 2025-05-07T20:32:16.5826364Z D: int, 2025-05-07T20:32:16.5826468Z scale_ub: Optional[float], 2025-05-07T20:32:16.5826556Z contiguous: bool, 2025-05-07T20:32:16.5826640Z compiled: bool, 2025-05-07T20:32:16.5826721Z ) -> None: 2025-05-07T20:32:16.5826816Z torch.manual_seed(2025) 2025-05-07T20:32:16.5826884Z 2025-05-07T20:32:16.5827072Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5827157Z 2025-05-07T20:32:16.5827353Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5827487Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5827574Z x = x_sign * x_clamp 2025-05-07T20:32:16.5827658Z x0 = x[:, :D] 2025-05-07T20:32:16.5827739Z x1 = x[:, D:] 2025-05-07T20:32:16.5827808Z 2025-05-07T20:32:16.5827895Z if contiguous: 2025-05-07T20:32:16.5827984Z x0 = x0.contiguous() 2025-05-07T20:32:16.5828073Z x1 = x1.contiguous() 2025-05-07T20:32:16.5828151Z 2025-05-07T20:32:16.5828236Z if scale_ub is not None: 2025-05-07T20:32:16.5828338Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5828480Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5828552Z ) 2025-05-07T20:32:16.5828630Z else: 2025-05-07T20:32:16.5828728Z scale_ub_tensor = None 2025-05-07T20:32:16.5828800Z 2025-05-07T20:32:16.5828927Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5829031Z op = silu_mul_quant 2025-05-07T20:32:16.5829112Z if compiled: 2025-05-07T20:32:16.5829215Z op = torch.compile(op) 2025-05-07T20:32:16.5829321Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5829389Z 2025-05-07T20:32:16.5829483Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.5829488Z 2025-05-07T20:32:16.5829583Z moe/activation_test.py:117: 2025-05-07T20:32:16.5829763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5829867Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.5829968Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5830356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.5830490Z return fn(*args, **kwargs) 2025-05-07T20:32:16.5831008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.5831112Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.5831486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5831718Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5832077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5832171Z kernel = self.compile( 2025-05-07T20:32:16.5832575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5832754Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5832885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5832889Z 2025-05-07T20:32:16.5833103Z self = 2025-05-07T20:32:16.5833913Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5834437Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f963959c360>} 2025-05-07T20:32:16.5835212Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5835403Z context = 2025-05-07T20:32:16.5835418Z 2025-05-07T20:32:16.5835585Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5835940Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5836054Z module_map=module_map) 2025-05-07T20:32:16.5836219Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5836319Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.5836403Z E ^ 2025-05-07T20:32:16.5836771Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5836778Z 2025-05-07T20:32:16.5837214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5837218Z 2025-05-07T20:32:16.5837322Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5837554Z self=, 2025-05-07T20:32:16.5837642Z T=128, 2025-05-07T20:32:16.5837719Z D=7168, 2025-05-07T20:32:16.5837804Z scale_ub=1200.0, 2025-05-07T20:32:16.5837898Z contiguous=False, 2025-05-07T20:32:16.5837987Z compiled=False, 2025-05-07T20:32:16.5838061Z ) 2025-05-07T20:32:16.5838292Z self = 2025-05-07T20:32:16.5838473Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:16.5838477Z 2025-05-07T20:32:16.5838567Z @given( 2025-05-07T20:32:16.5838688Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5838870Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5838993Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5839113Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5839230Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5846390Z ) 2025-05-07T20:32:16.5846691Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5846785Z def test_silu_mul_quant( 2025-05-07T20:32:16.5846864Z self, 2025-05-07T20:32:16.5846959Z T: int, 2025-05-07T20:32:16.5847037Z D: int, 2025-05-07T20:32:16.5847136Z scale_ub: Optional[float], 2025-05-07T20:32:16.5847237Z contiguous: bool, 2025-05-07T20:32:16.5847325Z compiled: bool, 2025-05-07T20:32:16.5847411Z ) -> None: 2025-05-07T20:32:16.5847508Z torch.manual_seed(2025) 2025-05-07T20:32:16.5847582Z 2025-05-07T20:32:16.5847768Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5847844Z 2025-05-07T20:32:16.5847939Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5848075Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5848166Z x = x_sign * x_clamp 2025-05-07T20:32:16.5848249Z x0 = x[:, :D] 2025-05-07T20:32:16.5848342Z x1 = x[:, D:] 2025-05-07T20:32:16.5848415Z 2025-05-07T20:32:16.5848501Z if contiguous: 2025-05-07T20:32:16.5848605Z x0 = x0.contiguous() 2025-05-07T20:32:16.5848696Z x1 = x1.contiguous() 2025-05-07T20:32:16.5848780Z 2025-05-07T20:32:16.5848873Z if scale_ub is not None: 2025-05-07T20:32:16.5848981Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5849126Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5849202Z ) 2025-05-07T20:32:16.5849279Z else: 2025-05-07T20:32:16.5849381Z scale_ub_tensor = None 2025-05-07T20:32:16.5849458Z 2025-05-07T20:32:16.5849590Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5849690Z op = silu_mul_quant 2025-05-07T20:32:16.5849779Z if compiled: 2025-05-07T20:32:16.5849881Z op = torch.compile(op) 2025-05-07T20:32:16.5849990Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5850067Z 2025-05-07T20:32:16.5850158Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.5850163Z 2025-05-07T20:32:16.5850264Z moe/activation_test.py:117: 2025-05-07T20:32:16.5850535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5850638Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.5850747Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5851459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.5851603Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.5852194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5852503Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5852911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5853018Z kernel = self.compile( 2025-05-07T20:32:16.5853417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5853607Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5853739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5853744Z 2025-05-07T20:32:16.5853955Z self = 2025-05-07T20:32:16.5854780Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5855433Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f963959cae0>} 2025-05-07T20:32:16.5856269Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5856464Z context = 2025-05-07T20:32:16.5856468Z 2025-05-07T20:32:16.5856648Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5856924Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5857046Z module_map=module_map) 2025-05-07T20:32:16.5857253Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5857358Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.5857438Z E ^ 2025-05-07T20:32:16.5857811Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5857819Z 2025-05-07T20:32:16.5858247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5858252Z 2025-05-07T20:32:16.5858368Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5858598Z self=, 2025-05-07T20:32:16.5858674Z T=128, 2025-05-07T20:32:16.5858758Z D=5120, 2025-05-07T20:32:16.5858841Z scale_ub=None, 2025-05-07T20:32:16.5858927Z contiguous=False, 2025-05-07T20:32:16.5859018Z compiled=False, 2025-05-07T20:32:16.5859094Z ) 2025-05-07T20:32:16.5859320Z self = 2025-05-07T20:32:16.5859505Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:16.5859510Z 2025-05-07T20:32:16.5859586Z @given( 2025-05-07T20:32:16.5859712Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5859816Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5859931Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5860054Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5860254Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5860332Z ) 2025-05-07T20:32:16.5860589Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5860681Z def test_silu_mul_quant( 2025-05-07T20:32:16.5860765Z self, 2025-05-07T20:32:16.5860841Z T: int, 2025-05-07T20:32:16.5860918Z D: int, 2025-05-07T20:32:16.5861024Z scale_ub: Optional[float], 2025-05-07T20:32:16.5861112Z contiguous: bool, 2025-05-07T20:32:16.5861198Z compiled: bool, 2025-05-07T20:32:16.5861282Z ) -> None: 2025-05-07T20:32:16.5861376Z torch.manual_seed(2025) 2025-05-07T20:32:16.5861451Z 2025-05-07T20:32:16.5861627Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5861705Z 2025-05-07T20:32:16.5861796Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5861927Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5862020Z x = x_sign * x_clamp 2025-05-07T20:32:16.5862101Z x0 = x[:, :D] 2025-05-07T20:32:16.5862186Z x1 = x[:, D:] 2025-05-07T20:32:16.5862260Z 2025-05-07T20:32:16.5862352Z if contiguous: 2025-05-07T20:32:16.5862448Z x0 = x0.contiguous() 2025-05-07T20:32:16.5862539Z x1 = x1.contiguous() 2025-05-07T20:32:16.5862618Z 2025-05-07T20:32:16.5862710Z if scale_ub is not None: 2025-05-07T20:32:16.5862869Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5863013Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5863088Z ) 2025-05-07T20:32:16.5863164Z else: 2025-05-07T20:32:16.5863263Z scale_ub_tensor = None 2025-05-07T20:32:16.5863379Z 2025-05-07T20:32:16.5863510Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5863606Z op = silu_mul_quant 2025-05-07T20:32:16.5863690Z if compiled: 2025-05-07T20:32:16.5863800Z op = torch.compile(op) 2025-05-07T20:32:16.5863906Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5863978Z 2025-05-07T20:32:16.5864074Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.5864078Z 2025-05-07T20:32:16.5864175Z moe/activation_test.py:117: 2025-05-07T20:32:16.5864311Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5864420Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.5864524Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5865038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.5865142Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.5865515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5865748Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5866103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5866199Z kernel = self.compile( 2025-05-07T20:32:16.5866600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5866778Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5866918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5866922Z 2025-05-07T20:32:16.5867134Z self = 2025-05-07T20:32:16.5867991Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5868602Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f963959dd00>} 2025-05-07T20:32:16.5869385Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5869587Z context = 2025-05-07T20:32:16.5869594Z 2025-05-07T20:32:16.5869763Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5870037Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5870149Z module_map=module_map) 2025-05-07T20:32:16.5870318Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5870427Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.5870503Z E ^ 2025-05-07T20:32:16.5870874Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5870880Z 2025-05-07T20:32:16.5871320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5871325Z 2025-05-07T20:32:16.5871428Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5871712Z self=, 2025-05-07T20:32:16.5871792Z T=128, 2025-05-07T20:32:16.5871868Z D=5120, 2025-05-07T20:32:16.5871960Z scale_ub=1200.0, 2025-05-07T20:32:16.5872046Z contiguous=True, 2025-05-07T20:32:16.5872132Z compiled=False, 2025-05-07T20:32:16.5872212Z ) 2025-05-07T20:32:16.5872482Z self = 2025-05-07T20:32:16.5872659Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.5872663Z 2025-05-07T20:32:16.5872751Z @given( 2025-05-07T20:32:16.5872875Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5872982Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5873099Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5873217Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5873338Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5873419Z ) 2025-05-07T20:32:16.5873671Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5873771Z def test_silu_mul_quant( 2025-05-07T20:32:16.5873847Z self, 2025-05-07T20:32:16.5873923Z T: int, 2025-05-07T20:32:16.5874007Z D: int, 2025-05-07T20:32:16.5874107Z scale_ub: Optional[float], 2025-05-07T20:32:16.5874196Z contiguous: bool, 2025-05-07T20:32:16.5874287Z compiled: bool, 2025-05-07T20:32:16.5874365Z ) -> None: 2025-05-07T20:32:16.5874496Z torch.manual_seed(2025) 2025-05-07T20:32:16.5874602Z 2025-05-07T20:32:16.5874832Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5874939Z 2025-05-07T20:32:16.5875032Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5875159Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5875253Z x = x_sign * x_clamp 2025-05-07T20:32:16.5875332Z x0 = x[:, :D] 2025-05-07T20:32:16.5875416Z x1 = x[:, D:] 2025-05-07T20:32:16.5875494Z 2025-05-07T20:32:16.5875577Z if contiguous: 2025-05-07T20:32:16.5875668Z x0 = x0.contiguous() 2025-05-07T20:32:16.5875762Z x1 = x1.contiguous() 2025-05-07T20:32:16.5875833Z 2025-05-07T20:32:16.5875925Z if scale_ub is not None: 2025-05-07T20:32:16.5876041Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5876179Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5876261Z ) 2025-05-07T20:32:16.5876480Z else: 2025-05-07T20:32:16.5876577Z scale_ub_tensor = None 2025-05-07T20:32:16.5876654Z 2025-05-07T20:32:16.5876784Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5876874Z op = silu_mul_quant 2025-05-07T20:32:16.5876965Z if compiled: 2025-05-07T20:32:16.5877064Z op = torch.compile(op) 2025-05-07T20:32:16.5877173Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5877256Z 2025-05-07T20:32:16.5877347Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.5877352Z 2025-05-07T20:32:16.5877457Z moe/activation_test.py:117: 2025-05-07T20:32:16.5877589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5877692Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.5877797Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5878319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.5878418Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.5878794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5879022Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5879381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5879523Z kernel = self.compile( 2025-05-07T20:32:16.5879919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5880105Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5880281Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5880285Z 2025-05-07T20:32:16.5880498Z self = 2025-05-07T20:32:16.5881322Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5881839Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f963959ede0>} 2025-05-07T20:32:16.5882623Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5882815Z context = 2025-05-07T20:32:16.5882822Z 2025-05-07T20:32:16.5882997Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5883274Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5883381Z module_map=module_map) 2025-05-07T20:32:16.5883551Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5883650Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.5883726Z E ^ 2025-05-07T20:32:16.5884097Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5884103Z 2025-05-07T20:32:16.5884531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5884536Z 2025-05-07T20:32:16.5884645Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5884878Z self=, 2025-05-07T20:32:16.5884956Z T=1, 2025-05-07T20:32:16.5885040Z D=7168, 2025-05-07T20:32:16.5885123Z scale_ub=1200.0, 2025-05-07T20:32:16.5885208Z contiguous=True, 2025-05-07T20:32:16.5885379Z compiled=True, 2025-05-07T20:32:16.5885455Z ) 2025-05-07T20:32:16.5885711Z self = 2025-05-07T20:32:16.5885945Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.5885953Z 2025-05-07T20:32:16.5886066Z @given( 2025-05-07T20:32:16.5886209Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5886318Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5886434Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5886560Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5886677Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5886755Z ) 2025-05-07T20:32:16.5887019Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5887112Z def test_silu_mul_quant( 2025-05-07T20:32:16.5887197Z self, 2025-05-07T20:32:16.5887279Z T: int, 2025-05-07T20:32:16.5887358Z D: int, 2025-05-07T20:32:16.5887462Z scale_ub: Optional[float], 2025-05-07T20:32:16.5887551Z contiguous: bool, 2025-05-07T20:32:16.5887639Z compiled: bool, 2025-05-07T20:32:16.5887724Z ) -> None: 2025-05-07T20:32:16.5887819Z torch.manual_seed(2025) 2025-05-07T20:32:16.5887893Z 2025-05-07T20:32:16.5888072Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5888209Z 2025-05-07T20:32:16.5888300Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5888441Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5888530Z x = x_sign * x_clamp 2025-05-07T20:32:16.5888618Z x0 = x[:, :D] 2025-05-07T20:32:16.5888742Z x1 = x[:, D:] 2025-05-07T20:32:16.5888814Z 2025-05-07T20:32:16.5888904Z if contiguous: 2025-05-07T20:32:16.5888995Z x0 = x0.contiguous() 2025-05-07T20:32:16.5889083Z x1 = x1.contiguous() 2025-05-07T20:32:16.5889163Z 2025-05-07T20:32:16.5889253Z if scale_ub is not None: 2025-05-07T20:32:16.5889358Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5889504Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5889579Z ) 2025-05-07T20:32:16.5889656Z else: 2025-05-07T20:32:16.5889755Z scale_ub_tensor = None 2025-05-07T20:32:16.5889830Z 2025-05-07T20:32:16.5889959Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5890057Z op = silu_mul_quant 2025-05-07T20:32:16.5890141Z if compiled: 2025-05-07T20:32:16.5890248Z op = torch.compile(op) 2025-05-07T20:32:16.5890353Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5890428Z 2025-05-07T20:32:16.5890529Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.5890533Z 2025-05-07T20:32:16.5890630Z moe/activation_test.py:117: 2025-05-07T20:32:16.5890766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5890873Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.5890973Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5891358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.5891451Z return fn(*args, **kwargs) 2025-05-07T20:32:16.5892085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.5892192Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.5892563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5892796Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5893154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5893351Z kernel = self.compile( 2025-05-07T20:32:16.5893757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5893937Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5894069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5894076Z 2025-05-07T20:32:16.5894294Z self = 2025-05-07T20:32:16.5895101Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5895628Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96385ac4a0>} 2025-05-07T20:32:16.5896411Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5896605Z context = 2025-05-07T20:32:16.5896616Z 2025-05-07T20:32:16.5896788Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5897149Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5897269Z module_map=module_map) 2025-05-07T20:32:16.5897439Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5897583Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.5897666Z E ^ 2025-05-07T20:32:16.5898033Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5898043Z 2025-05-07T20:32:16.5898477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5898482Z 2025-05-07T20:32:16.5898584Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5898814Z self=, 2025-05-07T20:32:16.5898898Z T=1, 2025-05-07T20:32:16.5898976Z D=7168, 2025-05-07T20:32:16.5899058Z scale_ub=1200.0, 2025-05-07T20:32:16.5899151Z contiguous=False, 2025-05-07T20:32:16.5899234Z compiled=True, 2025-05-07T20:32:16.5899307Z ) 2025-05-07T20:32:16.5899541Z self = 2025-05-07T20:32:16.5899713Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:16.5899720Z 2025-05-07T20:32:16.5899803Z @given( 2025-05-07T20:32:16.5899926Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5900029Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5900152Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5900270Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5900384Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5900466Z ) 2025-05-07T20:32:16.5900722Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5900819Z def test_silu_mul_quant( 2025-05-07T20:32:16.5900904Z self, 2025-05-07T20:32:16.5900984Z T: int, 2025-05-07T20:32:16.5901072Z D: int, 2025-05-07T20:32:16.5901172Z scale_ub: Optional[float], 2025-05-07T20:32:16.5901265Z contiguous: bool, 2025-05-07T20:32:16.5901362Z compiled: bool, 2025-05-07T20:32:16.5901446Z ) -> None: 2025-05-07T20:32:16.5901542Z torch.manual_seed(2025) 2025-05-07T20:32:16.5901627Z 2025-05-07T20:32:16.5901799Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5901962Z 2025-05-07T20:32:16.5902064Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5902194Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5902288Z x = x_sign * x_clamp 2025-05-07T20:32:16.5902379Z x0 = x[:, :D] 2025-05-07T20:32:16.5902462Z x1 = x[:, D:] 2025-05-07T20:32:16.5902542Z 2025-05-07T20:32:16.5902628Z if contiguous: 2025-05-07T20:32:16.5902725Z x0 = x0.contiguous() 2025-05-07T20:32:16.5902824Z x1 = x1.contiguous() 2025-05-07T20:32:16.5902898Z 2025-05-07T20:32:16.5902989Z if scale_ub is not None: 2025-05-07T20:32:16.5903108Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5903246Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5903324Z ) 2025-05-07T20:32:16.5903424Z else: 2025-05-07T20:32:16.5903519Z scale_ub_tensor = None 2025-05-07T20:32:16.5903592Z 2025-05-07T20:32:16.5903737Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5903830Z op = silu_mul_quant 2025-05-07T20:32:16.5903917Z if compiled: 2025-05-07T20:32:16.5904025Z op = torch.compile(op) 2025-05-07T20:32:16.5904131Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5904209Z 2025-05-07T20:32:16.5904301Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.5904352Z 2025-05-07T20:32:16.5904451Z moe/activation_test.py:117: 2025-05-07T20:32:16.5904591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5904712Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.5904823Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5905203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.5905343Z return fn(*args, **kwargs) 2025-05-07T20:32:16.5905869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.5905968Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.5907616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5907969Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5908370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5908475Z kernel = self.compile( 2025-05-07T20:32:16.5908883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5909067Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5909226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5909233Z 2025-05-07T20:32:16.5909461Z self = 2025-05-07T20:32:16.5910279Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5910805Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96385adb20>} 2025-05-07T20:32:16.5911596Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5911797Z context = 2025-05-07T20:32:16.5911803Z 2025-05-07T20:32:16.5911973Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5912628Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5912743Z module_map=module_map) 2025-05-07T20:32:16.5912917Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5913018Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.5913097Z E ^ 2025-05-07T20:32:16.5913476Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5913485Z 2025-05-07T20:32:16.5913917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5913922Z 2025-05-07T20:32:16.5914028Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5914269Z self=, 2025-05-07T20:32:16.5914348Z T=1, 2025-05-07T20:32:16.5914433Z D=7168, 2025-05-07T20:32:16.5914523Z scale_ub=None, 2025-05-07T20:32:16.5914613Z contiguous=False, 2025-05-07T20:32:16.5914706Z compiled=True, 2025-05-07T20:32:16.5914786Z ) 2025-05-07T20:32:16.5915014Z self = 2025-05-07T20:32:16.5915189Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:16.5915194Z 2025-05-07T20:32:16.5915273Z @given( 2025-05-07T20:32:16.5915478Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5915583Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5915699Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5915823Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5915938Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5916093Z ) 2025-05-07T20:32:16.5916356Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5916453Z def test_silu_mul_quant( 2025-05-07T20:32:16.5916539Z self, 2025-05-07T20:32:16.5916629Z T: int, 2025-05-07T20:32:16.5916708Z D: int, 2025-05-07T20:32:16.5916809Z scale_ub: Optional[float], 2025-05-07T20:32:16.5916907Z contiguous: bool, 2025-05-07T20:32:16.5916997Z compiled: bool, 2025-05-07T20:32:16.5917082Z ) -> None: 2025-05-07T20:32:16.5917187Z torch.manual_seed(2025) 2025-05-07T20:32:16.5917266Z 2025-05-07T20:32:16.5917447Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5917523Z 2025-05-07T20:32:16.5917620Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5917757Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5917848Z x = x_sign * x_clamp 2025-05-07T20:32:16.5917936Z x0 = x[:, :D] 2025-05-07T20:32:16.5918026Z x1 = x[:, D:] 2025-05-07T20:32:16.5918101Z 2025-05-07T20:32:16.5918191Z if contiguous: 2025-05-07T20:32:16.5918296Z x0 = x0.contiguous() 2025-05-07T20:32:16.5918395Z x1 = x1.contiguous() 2025-05-07T20:32:16.5918472Z 2025-05-07T20:32:16.5918573Z if scale_ub is not None: 2025-05-07T20:32:16.5918683Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5918830Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5918910Z ) 2025-05-07T20:32:16.5918991Z else: 2025-05-07T20:32:16.5919095Z scale_ub_tensor = None 2025-05-07T20:32:16.5919171Z 2025-05-07T20:32:16.5919304Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5919403Z op = silu_mul_quant 2025-05-07T20:32:16.5919491Z if compiled: 2025-05-07T20:32:16.5919593Z op = torch.compile(op) 2025-05-07T20:32:16.5919709Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5919785Z 2025-05-07T20:32:16.5919879Z y_fp8, y_scale = fn() 2025-05-07T20:32:16.5920009Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:16.5920169Z 2025-05-07T20:32:16.5920319Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5920424Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:16.5920526Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:16.5920657Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:16.5920802Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.5920884Z 2025-05-07T20:32:16.5920994Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.5920998Z 2025-05-07T20:32:16.5921100Z moe/activation_test.py:126: 2025-05-07T20:32:16.5921237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5921355Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:16.5921494Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.5922088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:16.5922193Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.5922568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5922804Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5923229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:16.5923500Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.5923891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:16.5924103Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.5924466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:16.5924549Z fn() 2025-05-07T20:32:16.5924966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:16.5925060Z self.fn.run( 2025-05-07T20:32:16.5925410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5925512Z kernel = self.compile( 2025-05-07T20:32:16.5925913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5926093Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5926236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5926244Z 2025-05-07T20:32:16.5926457Z self = 2025-05-07T20:32:16.5927286Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5927807Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96385ae840>} 2025-05-07T20:32:16.5928592Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5928799Z context = 2025-05-07T20:32:16.5928804Z 2025-05-07T20:32:16.5928974Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5929262Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5929451Z module_map=module_map) 2025-05-07T20:32:16.5929619Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5929729Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.5929808Z E ^ 2025-05-07T20:32:16.5930183Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5930196Z 2025-05-07T20:32:16.5930638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5930642Z 2025-05-07T20:32:16.5930746Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5930985Z self=, 2025-05-07T20:32:16.5931065Z T=1, 2025-05-07T20:32:16.5931144Z D=5120, 2025-05-07T20:32:16.5931236Z scale_ub=1200.0, 2025-05-07T20:32:16.5931325Z contiguous=False, 2025-05-07T20:32:16.5931408Z compiled=True, 2025-05-07T20:32:16.5931489Z ) 2025-05-07T20:32:16.5931722Z self = 2025-05-07T20:32:16.5932001Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:16.5932006Z 2025-05-07T20:32:16.5932085Z @given( 2025-05-07T20:32:16.5932207Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5932312Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5932474Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5932593Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5932713Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5932789Z ) 2025-05-07T20:32:16.5933051Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5933190Z def test_silu_mul_quant( 2025-05-07T20:32:16.5933266Z self, 2025-05-07T20:32:16.5933348Z T: int, 2025-05-07T20:32:16.5933424Z D: int, 2025-05-07T20:32:16.5933529Z scale_ub: Optional[float], 2025-05-07T20:32:16.5933624Z contiguous: bool, 2025-05-07T20:32:16.5933710Z compiled: bool, 2025-05-07T20:32:16.5933788Z ) -> None: 2025-05-07T20:32:16.5933889Z torch.manual_seed(2025) 2025-05-07T20:32:16.5933961Z 2025-05-07T20:32:16.5934136Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5934216Z 2025-05-07T20:32:16.5934313Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5934438Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5934533Z x = x_sign * x_clamp 2025-05-07T20:32:16.5934612Z x0 = x[:, :D] 2025-05-07T20:32:16.5934698Z x1 = x[:, D:] 2025-05-07T20:32:16.5934771Z 2025-05-07T20:32:16.5934858Z if contiguous: 2025-05-07T20:32:16.5934957Z x0 = x0.contiguous() 2025-05-07T20:32:16.5935047Z x1 = x1.contiguous() 2025-05-07T20:32:16.5935120Z 2025-05-07T20:32:16.5935219Z if scale_ub is not None: 2025-05-07T20:32:16.5935328Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5935467Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5935548Z ) 2025-05-07T20:32:16.5935623Z else: 2025-05-07T20:32:16.5935718Z scale_ub_tensor = None 2025-05-07T20:32:16.5935796Z 2025-05-07T20:32:16.5935929Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5936028Z op = silu_mul_quant 2025-05-07T20:32:16.5936113Z if compiled: 2025-05-07T20:32:16.5936213Z op = torch.compile(op) 2025-05-07T20:32:16.5936328Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5936400Z 2025-05-07T20:32:16.5936490Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.5936497Z 2025-05-07T20:32:16.5936600Z moe/activation_test.py:117: 2025-05-07T20:32:16.5936733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5936916Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.5937036Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5937462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.5937561Z return fn(*args, **kwargs) 2025-05-07T20:32:16.5938083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.5938183Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.5938565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5938797Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5939160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5939263Z kernel = self.compile( 2025-05-07T20:32:16.5939670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5939856Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5939989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5939994Z 2025-05-07T20:32:16.5940206Z self = 2025-05-07T20:32:16.5941113Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5941647Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96385aff60>} 2025-05-07T20:32:16.5942494Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5942693Z context = 2025-05-07T20:32:16.5942698Z 2025-05-07T20:32:16.5942876Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5943153Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5943263Z module_map=module_map) 2025-05-07T20:32:16.5943435Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5943537Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.5943616Z E ^ 2025-05-07T20:32:16.5944000Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5944005Z 2025-05-07T20:32:16.5944450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5944454Z 2025-05-07T20:32:16.5944564Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5944798Z self=, 2025-05-07T20:32:16.5944877Z T=1, 2025-05-07T20:32:16.5944963Z D=5120, 2025-05-07T20:32:16.5945047Z scale_ub=1200.0, 2025-05-07T20:32:16.5945135Z contiguous=False, 2025-05-07T20:32:16.5945227Z compiled=False, 2025-05-07T20:32:16.5945302Z ) 2025-05-07T20:32:16.5945531Z self = 2025-05-07T20:32:16.5945716Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:16.5945721Z 2025-05-07T20:32:16.5945803Z @given( 2025-05-07T20:32:16.5945930Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5946031Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5946228Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5946355Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5946471Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5946546Z ) 2025-05-07T20:32:16.5946811Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5946904Z def test_silu_mul_quant( 2025-05-07T20:32:16.5946987Z self, 2025-05-07T20:32:16.5947066Z T: int, 2025-05-07T20:32:16.5947143Z D: int, 2025-05-07T20:32:16.5947248Z scale_ub: Optional[float], 2025-05-07T20:32:16.5947337Z contiguous: bool, 2025-05-07T20:32:16.5947422Z compiled: bool, 2025-05-07T20:32:16.5947506Z ) -> None: 2025-05-07T20:32:16.5947601Z torch.manual_seed(2025) 2025-05-07T20:32:16.5947677Z 2025-05-07T20:32:16.5947853Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5947927Z 2025-05-07T20:32:16.5948022Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5948158Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5948245Z x = x_sign * x_clamp 2025-05-07T20:32:16.5948326Z x0 = x[:, :D] 2025-05-07T20:32:16.5948411Z x1 = x[:, D:] 2025-05-07T20:32:16.5948483Z 2025-05-07T20:32:16.5948573Z if contiguous: 2025-05-07T20:32:16.5948664Z x0 = x0.contiguous() 2025-05-07T20:32:16.5948797Z x1 = x1.contiguous() 2025-05-07T20:32:16.5948874Z 2025-05-07T20:32:16.5948963Z if scale_ub is not None: 2025-05-07T20:32:16.5949068Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5949211Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5949289Z ) 2025-05-07T20:32:16.5949407Z else: 2025-05-07T20:32:16.5949509Z scale_ub_tensor = None 2025-05-07T20:32:16.5949583Z 2025-05-07T20:32:16.5949715Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5949816Z op = silu_mul_quant 2025-05-07T20:32:16.5949900Z if compiled: 2025-05-07T20:32:16.5950006Z op = torch.compile(op) 2025-05-07T20:32:16.5950112Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5950184Z 2025-05-07T20:32:16.5950281Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.5950285Z 2025-05-07T20:32:16.5950384Z moe/activation_test.py:117: 2025-05-07T20:32:16.5950519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5950627Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.5950728Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5951259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.5951367Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.5951746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5951988Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5952349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5952443Z kernel = self.compile( 2025-05-07T20:32:16.5952854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5953036Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5953174Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5953178Z 2025-05-07T20:32:16.5953390Z self = 2025-05-07T20:32:16.5954298Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5954836Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f963966e980>} 2025-05-07T20:32:16.5955629Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5955835Z context = 2025-05-07T20:32:16.5955839Z 2025-05-07T20:32:16.5956009Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5956285Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5956401Z module_map=module_map) 2025-05-07T20:32:16.5956566Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5956679Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.5956756Z E ^ 2025-05-07T20:32:16.5957131Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5957135Z 2025-05-07T20:32:16.5957580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5957627Z 2025-05-07T20:32:16.5957733Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5957971Z self=, 2025-05-07T20:32:16.5958049Z T=16384, 2025-05-07T20:32:16.5958127Z D=5120, 2025-05-07T20:32:16.5958217Z scale_ub=1200.0, 2025-05-07T20:32:16.5958343Z contiguous=False, 2025-05-07T20:32:16.5958431Z compiled=True, 2025-05-07T20:32:16.5958514Z ) 2025-05-07T20:32:16.5958744Z self = 2025-05-07T20:32:16.5958939Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:16.5958943Z 2025-05-07T20:32:16.5959030Z @given( 2025-05-07T20:32:16.5959152Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5959259Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5959377Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5959496Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5959624Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5959701Z ) 2025-05-07T20:32:16.5959962Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5960063Z def test_silu_mul_quant( 2025-05-07T20:32:16.5960140Z self, 2025-05-07T20:32:16.5960220Z T: int, 2025-05-07T20:32:16.5960304Z D: int, 2025-05-07T20:32:16.5960404Z scale_ub: Optional[float], 2025-05-07T20:32:16.5960495Z contiguous: bool, 2025-05-07T20:32:16.5960596Z compiled: bool, 2025-05-07T20:32:16.5960674Z ) -> None: 2025-05-07T20:32:16.5960775Z torch.manual_seed(2025) 2025-05-07T20:32:16.5960848Z 2025-05-07T20:32:16.5961022Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5961102Z 2025-05-07T20:32:16.5961197Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5961324Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5961421Z x = x_sign * x_clamp 2025-05-07T20:32:16.5961501Z x0 = x[:, :D] 2025-05-07T20:32:16.5961581Z x1 = x[:, D:] 2025-05-07T20:32:16.5961661Z 2025-05-07T20:32:16.5961744Z if contiguous: 2025-05-07T20:32:16.5961835Z x0 = x0.contiguous() 2025-05-07T20:32:16.5961929Z x1 = x1.contiguous() 2025-05-07T20:32:16.5962004Z 2025-05-07T20:32:16.5962093Z if scale_ub is not None: 2025-05-07T20:32:16.5962205Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5962423Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5962506Z ) 2025-05-07T20:32:16.5962583Z else: 2025-05-07T20:32:16.5962677Z scale_ub_tensor = None 2025-05-07T20:32:16.5962755Z 2025-05-07T20:32:16.5962886Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5962976Z op = silu_mul_quant 2025-05-07T20:32:16.5963067Z if compiled: 2025-05-07T20:32:16.5963169Z op = torch.compile(op) 2025-05-07T20:32:16.5963275Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5963356Z 2025-05-07T20:32:16.5963446Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.5963450Z 2025-05-07T20:32:16.5963554Z moe/activation_test.py:117: 2025-05-07T20:32:16.5963688Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5963789Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.5963895Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5964290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.5964384Z return fn(*args, **kwargs) 2025-05-07T20:32:16.5964911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.5965011Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.5965437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5965668Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5966026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5966165Z kernel = self.compile( 2025-05-07T20:32:16.5966569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5966757Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5966896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5966901Z 2025-05-07T20:32:16.5967111Z self = 2025-05-07T20:32:16.5967997Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5968534Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f963966fec0>} 2025-05-07T20:32:16.5969338Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5969539Z context = 2025-05-07T20:32:16.5969544Z 2025-05-07T20:32:16.5969712Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5969994Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5970103Z module_map=module_map) 2025-05-07T20:32:16.5970269Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5970376Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.5970452Z E ^ 2025-05-07T20:32:16.5970834Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5970840Z 2025-05-07T20:32:16.5971280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5971284Z 2025-05-07T20:32:16.5971493Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5971736Z self=, 2025-05-07T20:32:16.5971859Z T=2048, 2025-05-07T20:32:16.5971941Z D=7168, 2025-05-07T20:32:16.5972023Z scale_ub=1200.0, 2025-05-07T20:32:16.5972125Z contiguous=False, 2025-05-07T20:32:16.5972217Z compiled=True, 2025-05-07T20:32:16.5972293Z ) 2025-05-07T20:32:16.5972524Z self = 2025-05-07T20:32:16.5972716Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:16.5972721Z 2025-05-07T20:32:16.5972799Z @given( 2025-05-07T20:32:16.5972921Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5973035Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5973152Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5973277Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5980068Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5980160Z ) 2025-05-07T20:32:16.5980433Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5980532Z def test_silu_mul_quant( 2025-05-07T20:32:16.5980620Z self, 2025-05-07T20:32:16.5980699Z T: int, 2025-05-07T20:32:16.5980777Z D: int, 2025-05-07T20:32:16.5980964Z scale_ub: Optional[float], 2025-05-07T20:32:16.5981056Z contiguous: bool, 2025-05-07T20:32:16.5981143Z compiled: bool, 2025-05-07T20:32:16.5981232Z ) -> None: 2025-05-07T20:32:16.5981335Z torch.manual_seed(2025) 2025-05-07T20:32:16.5981410Z 2025-05-07T20:32:16.5981595Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5981717Z 2025-05-07T20:32:16.5981812Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5981947Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5982042Z x = x_sign * x_clamp 2025-05-07T20:32:16.5982131Z x0 = x[:, :D] 2025-05-07T20:32:16.5982212Z x1 = x[:, D:] 2025-05-07T20:32:16.5982285Z 2025-05-07T20:32:16.5982376Z if contiguous: 2025-05-07T20:32:16.5982469Z x0 = x0.contiguous() 2025-05-07T20:32:16.5982558Z x1 = x1.contiguous() 2025-05-07T20:32:16.5982642Z 2025-05-07T20:32:16.5982731Z if scale_ub is not None: 2025-05-07T20:32:16.5982845Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5982990Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5983070Z ) 2025-05-07T20:32:16.5983150Z else: 2025-05-07T20:32:16.5983256Z scale_ub_tensor = None 2025-05-07T20:32:16.5983329Z 2025-05-07T20:32:16.5983472Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5983609Z op = silu_mul_quant 2025-05-07T20:32:16.5983733Z if compiled: 2025-05-07T20:32:16.5983866Z op = torch.compile(op) 2025-05-07T20:32:16.5983971Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5984052Z 2025-05-07T20:32:16.5984143Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.5984148Z 2025-05-07T20:32:16.5984255Z moe/activation_test.py:117: 2025-05-07T20:32:16.5984391Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5984493Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.5984607Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5984997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.5985094Z return fn(*args, **kwargs) 2025-05-07T20:32:16.5985615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.5985717Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.5986186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5986421Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5986774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5986875Z kernel = self.compile( 2025-05-07T20:32:16.5987323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5987506Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5987645Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5987650Z 2025-05-07T20:32:16.5987865Z self = 2025-05-07T20:32:16.5988690Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5989208Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f963966fc40>} 2025-05-07T20:32:16.5989993Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5990228Z context = 2025-05-07T20:32:16.5990233Z 2025-05-07T20:32:16.5990401Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5990718Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5990825Z module_map=module_map) 2025-05-07T20:32:16.5991003Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5991102Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.5991181Z E ^ 2025-05-07T20:32:16.5991557Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5991562Z 2025-05-07T20:32:16.5991992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5991999Z 2025-05-07T20:32:16.5992103Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5992341Z self=, 2025-05-07T20:32:16.5992417Z T=1, 2025-05-07T20:32:16.5992498Z D=5120, 2025-05-07T20:32:16.5992584Z scale_ub=None, 2025-05-07T20:32:16.5992671Z contiguous=False, 2025-05-07T20:32:16.5992762Z compiled=False, 2025-05-07T20:32:16.5992836Z ) 2025-05-07T20:32:16.5993065Z self = 2025-05-07T20:32:16.5993243Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:16.5993247Z 2025-05-07T20:32:16.5993324Z @given( 2025-05-07T20:32:16.5993445Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5993550Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5993665Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5993790Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5993903Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5993977Z ) 2025-05-07T20:32:16.5994237Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5994333Z def test_silu_mul_quant( 2025-05-07T20:32:16.5994413Z self, 2025-05-07T20:32:16.5994494Z T: int, 2025-05-07T20:32:16.5994571Z D: int, 2025-05-07T20:32:16.5994668Z scale_ub: Optional[float], 2025-05-07T20:32:16.5994843Z contiguous: bool, 2025-05-07T20:32:16.5994931Z compiled: bool, 2025-05-07T20:32:16.5995010Z ) -> None: 2025-05-07T20:32:16.5995109Z torch.manual_seed(2025) 2025-05-07T20:32:16.5995182Z 2025-05-07T20:32:16.5995360Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5995433Z 2025-05-07T20:32:16.5995525Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5995658Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5995746Z x = x_sign * x_clamp 2025-05-07T20:32:16.5995828Z x0 = x[:, :D] 2025-05-07T20:32:16.5995916Z x1 = x[:, D:] 2025-05-07T20:32:16.5995987Z 2025-05-07T20:32:16.5996072Z if contiguous: 2025-05-07T20:32:16.5996173Z x0 = x0.contiguous() 2025-05-07T20:32:16.5996261Z x1 = x1.contiguous() 2025-05-07T20:32:16.5996334Z 2025-05-07T20:32:16.5996432Z if scale_ub is not None: 2025-05-07T20:32:16.5996544Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5996687Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5996763Z ) 2025-05-07T20:32:16.5996839Z else: 2025-05-07T20:32:16.5996938Z scale_ub_tensor = None 2025-05-07T20:32:16.5997010Z 2025-05-07T20:32:16.5997139Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5997234Z op = silu_mul_quant 2025-05-07T20:32:16.5997365Z if compiled: 2025-05-07T20:32:16.5997466Z op = torch.compile(op) 2025-05-07T20:32:16.5997577Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5997648Z 2025-05-07T20:32:16.5997762Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.5997774Z 2025-05-07T20:32:16.5997930Z moe/activation_test.py:117: 2025-05-07T20:32:16.5998069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5998175Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.5998281Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5998800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.5998904Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.5999274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5999506Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5999868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5999961Z kernel = self.compile( 2025-05-07T20:32:16.6000364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6000547Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6000682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6000686Z 2025-05-07T20:32:16.6000903Z self = 2025-05-07T20:32:16.6001710Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6002237Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96ce38cb80>} 2025-05-07T20:32:16.6003016Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6003222Z context = 2025-05-07T20:32:16.6003226Z 2025-05-07T20:32:16.6003470Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6003742Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6003856Z module_map=module_map) 2025-05-07T20:32:16.6004020Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6004121Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6004205Z E ^ 2025-05-07T20:32:16.6004572Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6004577Z 2025-05-07T20:32:16.6005012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6005019Z 2025-05-07T20:32:16.6005123Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6005355Z self=, 2025-05-07T20:32:16.6005446Z T=4096, 2025-05-07T20:32:16.6005524Z D=7168, 2025-05-07T20:32:16.6005609Z scale_ub=1200.0, 2025-05-07T20:32:16.6005706Z contiguous=False, 2025-05-07T20:32:16.6005790Z compiled=False, 2025-05-07T20:32:16.6005870Z ) 2025-05-07T20:32:16.6006097Z self = 2025-05-07T20:32:16.6006641Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:16.6006749Z 2025-05-07T20:32:16.6006836Z @given( 2025-05-07T20:32:16.6006962Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6007063Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6007210Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6007420Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6007535Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6007618Z ) 2025-05-07T20:32:16.6007877Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6007977Z def test_silu_mul_quant( 2025-05-07T20:32:16.6008054Z self, 2025-05-07T20:32:16.6008130Z T: int, 2025-05-07T20:32:16.6008215Z D: int, 2025-05-07T20:32:16.6008312Z scale_ub: Optional[float], 2025-05-07T20:32:16.6008401Z contiguous: bool, 2025-05-07T20:32:16.6008490Z compiled: bool, 2025-05-07T20:32:16.6008575Z ) -> None: 2025-05-07T20:32:16.6008669Z torch.manual_seed(2025) 2025-05-07T20:32:16.6008746Z 2025-05-07T20:32:16.6008915Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6008989Z 2025-05-07T20:32:16.6009088Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6009216Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6009309Z x = x_sign * x_clamp 2025-05-07T20:32:16.6009388Z x0 = x[:, :D] 2025-05-07T20:32:16.6009468Z x1 = x[:, D:] 2025-05-07T20:32:16.6009550Z 2025-05-07T20:32:16.6009634Z if contiguous: 2025-05-07T20:32:16.6009724Z x0 = x0.contiguous() 2025-05-07T20:32:16.6009818Z x1 = x1.contiguous() 2025-05-07T20:32:16.6009890Z 2025-05-07T20:32:16.6009982Z if scale_ub is not None: 2025-05-07T20:32:16.6010094Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6010230Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6010309Z ) 2025-05-07T20:32:16.6010392Z else: 2025-05-07T20:32:16.6010485Z scale_ub_tensor = None 2025-05-07T20:32:16.6010558Z 2025-05-07T20:32:16.6010699Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6010788Z op = silu_mul_quant 2025-05-07T20:32:16.6010884Z if compiled: 2025-05-07T20:32:16.6010983Z op = torch.compile(op) 2025-05-07T20:32:16.6011088Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6011167Z 2025-05-07T20:32:16.6011407Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6011412Z 2025-05-07T20:32:16.6011512Z moe/activation_test.py:117: 2025-05-07T20:32:16.6011651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6011833Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6011938Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6012467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6012569Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6012953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6013187Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6013540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6013647Z kernel = self.compile( 2025-05-07T20:32:16.6014046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6014230Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6014360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6014411Z 2025-05-07T20:32:16.6014622Z self = 2025-05-07T20:32:16.6015447Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6016007Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96d4879800>} 2025-05-07T20:32:16.6016800Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6016997Z context = 2025-05-07T20:32:16.6017002Z 2025-05-07T20:32:16.6017173Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6017462Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6017569Z module_map=module_map) 2025-05-07T20:32:16.6017758Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6017868Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6017967Z E ^ 2025-05-07T20:32:16.6018350Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6018354Z 2025-05-07T20:32:16.6018798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6018803Z 2025-05-07T20:32:16.6018913Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6019149Z self=, 2025-05-07T20:32:16.6019228Z T=16384, 2025-05-07T20:32:16.6019313Z D=7168, 2025-05-07T20:32:16.6019398Z scale_ub=None, 2025-05-07T20:32:16.6019485Z contiguous=True, 2025-05-07T20:32:16.6019576Z compiled=True, 2025-05-07T20:32:16.6019652Z ) 2025-05-07T20:32:16.6019880Z self = 2025-05-07T20:32:16.6020067Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.6020073Z 2025-05-07T20:32:16.6020151Z @given( 2025-05-07T20:32:16.6020280Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6020460Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6020578Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6020702Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6020817Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6020895Z ) 2025-05-07T20:32:16.6021157Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6021254Z def test_silu_mul_quant( 2025-05-07T20:32:16.6021334Z self, 2025-05-07T20:32:16.6021417Z T: int, 2025-05-07T20:32:16.6021494Z D: int, 2025-05-07T20:32:16.6021594Z scale_ub: Optional[float], 2025-05-07T20:32:16.6021694Z contiguous: bool, 2025-05-07T20:32:16.6021782Z compiled: bool, 2025-05-07T20:32:16.6021872Z ) -> None: 2025-05-07T20:32:16.6021967Z torch.manual_seed(2025) 2025-05-07T20:32:16.6022041Z 2025-05-07T20:32:16.6022220Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6022302Z 2025-05-07T20:32:16.6022396Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6022533Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6022620Z x = x_sign * x_clamp 2025-05-07T20:32:16.6022702Z x0 = x[:, :D] 2025-05-07T20:32:16.6022789Z x1 = x[:, D:] 2025-05-07T20:32:16.6022861Z 2025-05-07T20:32:16.6022943Z if contiguous: 2025-05-07T20:32:16.6023091Z x0 = x0.contiguous() 2025-05-07T20:32:16.6023181Z x1 = x1.contiguous() 2025-05-07T20:32:16.6023261Z 2025-05-07T20:32:16.6023352Z if scale_ub is not None: 2025-05-07T20:32:16.6023458Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6023602Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6023723Z ) 2025-05-07T20:32:16.6023800Z else: 2025-05-07T20:32:16.6023900Z scale_ub_tensor = None 2025-05-07T20:32:16.6023974Z 2025-05-07T20:32:16.6024111Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6024208Z op = silu_mul_quant 2025-05-07T20:32:16.6024292Z if compiled: 2025-05-07T20:32:16.6024392Z op = torch.compile(op) 2025-05-07T20:32:16.6024504Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6024576Z 2025-05-07T20:32:16.6024674Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6024681Z 2025-05-07T20:32:16.6024782Z moe/activation_test.py:117: 2025-05-07T20:32:16.6024915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6025025Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6025127Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6025514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.6025617Z return fn(*args, **kwargs) 2025-05-07T20:32:16.6026148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6026253Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6026629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6026861Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6027228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6027322Z kernel = self.compile( 2025-05-07T20:32:16.6027729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6027915Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6028047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6028051Z 2025-05-07T20:32:16.6028349Z self = 2025-05-07T20:32:16.6029174Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6029705Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96cf3f0860>} 2025-05-07T20:32:16.6030511Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6030709Z context = 2025-05-07T20:32:16.6030714Z 2025-05-07T20:32:16.6030890Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6031171Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6031283Z module_map=module_map) 2025-05-07T20:32:16.6031451Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6031551Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6031636Z E ^ 2025-05-07T20:32:16.6032011Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6032057Z 2025-05-07T20:32:16.6032497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6032502Z 2025-05-07T20:32:16.6032616Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6032889Z self=, 2025-05-07T20:32:16.6032971Z T=4096, 2025-05-07T20:32:16.6033048Z D=5120, 2025-05-07T20:32:16.6033130Z scale_ub=None, 2025-05-07T20:32:16.6033229Z contiguous=False, 2025-05-07T20:32:16.6033314Z compiled=True, 2025-05-07T20:32:16.6033385Z ) 2025-05-07T20:32:16.6033622Z self = 2025-05-07T20:32:16.6033805Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:16.6033809Z 2025-05-07T20:32:16.6033889Z @given( 2025-05-07T20:32:16.6034019Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6034120Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6034244Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6034362Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6034477Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6034562Z ) 2025-05-07T20:32:16.6034818Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6034912Z def test_silu_mul_quant( 2025-05-07T20:32:16.6034999Z self, 2025-05-07T20:32:16.6035078Z T: int, 2025-05-07T20:32:16.6035156Z D: int, 2025-05-07T20:32:16.6035263Z scale_ub: Optional[float], 2025-05-07T20:32:16.6035353Z contiguous: bool, 2025-05-07T20:32:16.6035441Z compiled: bool, 2025-05-07T20:32:16.6035527Z ) -> None: 2025-05-07T20:32:16.6035622Z torch.manual_seed(2025) 2025-05-07T20:32:16.6035703Z 2025-05-07T20:32:16.6035875Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6035948Z 2025-05-07T20:32:16.6036051Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6036177Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6036265Z x = x_sign * x_clamp 2025-05-07T20:32:16.6036359Z x0 = x[:, :D] 2025-05-07T20:32:16.6036440Z x1 = x[:, D:] 2025-05-07T20:32:16.6036512Z 2025-05-07T20:32:16.6036604Z if contiguous: 2025-05-07T20:32:16.6036695Z x0 = x0.contiguous() 2025-05-07T20:32:16.6036862Z x1 = x1.contiguous() 2025-05-07T20:32:16.6036944Z 2025-05-07T20:32:16.6037036Z if scale_ub is not None: 2025-05-07T20:32:16.6037148Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6037287Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6037364Z ) 2025-05-07T20:32:16.6037448Z else: 2025-05-07T20:32:16.6037546Z scale_ub_tensor = None 2025-05-07T20:32:16.6037618Z 2025-05-07T20:32:16.6037755Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6037844Z op = silu_mul_quant 2025-05-07T20:32:16.6037928Z if compiled: 2025-05-07T20:32:16.6038033Z op = torch.compile(op) 2025-05-07T20:32:16.6038143Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6038217Z 2025-05-07T20:32:16.6038314Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6038318Z 2025-05-07T20:32:16.6038420Z moe/activation_test.py:117: 2025-05-07T20:32:16.6038564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6038667Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6038779Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6039166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.6039305Z return fn(*args, **kwargs) 2025-05-07T20:32:16.6039832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6039930Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6040307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6040609Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6040970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6041068Z kernel = self.compile( 2025-05-07T20:32:16.6041471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6041649Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6041787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6041795Z 2025-05-07T20:32:16.6042004Z self = 2025-05-07T20:32:16.6042831Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6043368Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9638a579c0>} 2025-05-07T20:32:16.6044162Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6044360Z context = 2025-05-07T20:32:16.6044364Z 2025-05-07T20:32:16.6044538Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6044819Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6044925Z module_map=module_map) 2025-05-07T20:32:16.6045090Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6045194Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6045269Z E ^ 2025-05-07T20:32:16.6045726Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6045731Z 2025-05-07T20:32:16.6046168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6046172Z 2025-05-07T20:32:16.6046275Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6046513Z self=, 2025-05-07T20:32:16.6046591Z T=4096, 2025-05-07T20:32:16.6046665Z D=5120, 2025-05-07T20:32:16.6046752Z scale_ub=1200.0, 2025-05-07T20:32:16.6046837Z contiguous=False, 2025-05-07T20:32:16.6046929Z compiled=False, 2025-05-07T20:32:16.6047004Z ) 2025-05-07T20:32:16.6047232Z self = 2025-05-07T20:32:16.6047422Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:16.6047426Z 2025-05-07T20:32:16.6047501Z @given( 2025-05-07T20:32:16.6047624Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6047727Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6047841Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6047958Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6048077Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6048151Z ) 2025-05-07T20:32:16.6048412Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6048548Z def test_silu_mul_quant( 2025-05-07T20:32:16.6048624Z self, 2025-05-07T20:32:16.6048702Z T: int, 2025-05-07T20:32:16.6048778Z D: int, 2025-05-07T20:32:16.6048875Z scale_ub: Optional[float], 2025-05-07T20:32:16.6048967Z contiguous: bool, 2025-05-07T20:32:16.6049092Z compiled: bool, 2025-05-07T20:32:16.6049169Z ) -> None: 2025-05-07T20:32:16.6049266Z torch.manual_seed(2025) 2025-05-07T20:32:16.6049338Z 2025-05-07T20:32:16.6049519Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6049597Z 2025-05-07T20:32:16.6049688Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6049819Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6049905Z x = x_sign * x_clamp 2025-05-07T20:32:16.6049985Z x0 = x[:, :D] 2025-05-07T20:32:16.6050069Z x1 = x[:, D:] 2025-05-07T20:32:16.6050145Z 2025-05-07T20:32:16.6050227Z if contiguous: 2025-05-07T20:32:16.6050323Z x0 = x0.contiguous() 2025-05-07T20:32:16.6050411Z x1 = x1.contiguous() 2025-05-07T20:32:16.6050481Z 2025-05-07T20:32:16.6050577Z if scale_ub is not None: 2025-05-07T20:32:16.6050683Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6050824Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6050903Z ) 2025-05-07T20:32:16.6050978Z else: 2025-05-07T20:32:16.6051071Z scale_ub_tensor = None 2025-05-07T20:32:16.6051157Z 2025-05-07T20:32:16.6051287Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6051380Z op = silu_mul_quant 2025-05-07T20:32:16.6051463Z if compiled: 2025-05-07T20:32:16.6051564Z op = torch.compile(op) 2025-05-07T20:32:16.6051672Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6051744Z 2025-05-07T20:32:16.6051887Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6051891Z 2025-05-07T20:32:16.6051998Z moe/activation_test.py:117: 2025-05-07T20:32:16.6052129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6052231Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6052338Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6052870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6052971Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6053433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6053667Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6054031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6054127Z kernel = self.compile( 2025-05-07T20:32:16.6054532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6054711Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6054841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6054848Z 2025-05-07T20:32:16.6055063Z self = 2025-05-07T20:32:16.6055899Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6056432Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9638a56d40>} 2025-05-07T20:32:16.6057268Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6057463Z context = 2025-05-07T20:32:16.6057468Z 2025-05-07T20:32:16.6057681Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6057957Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6058076Z module_map=module_map) 2025-05-07T20:32:16.6058241Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6058340Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6058420Z E ^ 2025-05-07T20:32:16.6058796Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6058801Z 2025-05-07T20:32:16.6059241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6059251Z 2025-05-07T20:32:16.6059354Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6059587Z self=, 2025-05-07T20:32:16.6059673Z T=4096, 2025-05-07T20:32:16.6059748Z D=5120, 2025-05-07T20:32:16.6059832Z scale_ub=1200.0, 2025-05-07T20:32:16.6059924Z contiguous=False, 2025-05-07T20:32:16.6060006Z compiled=True, 2025-05-07T20:32:16.6060076Z ) 2025-05-07T20:32:16.6060313Z self = 2025-05-07T20:32:16.6060493Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:16.6060498Z 2025-05-07T20:32:16.6060578Z @given( 2025-05-07T20:32:16.6060697Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6060796Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6060919Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6061035Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6061148Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6061228Z ) 2025-05-07T20:32:16.6061480Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6061576Z def test_silu_mul_quant( 2025-05-07T20:32:16.6061659Z self, 2025-05-07T20:32:16.6061733Z T: int, 2025-05-07T20:32:16.6061807Z D: int, 2025-05-07T20:32:16.6061990Z scale_ub: Optional[float], 2025-05-07T20:32:16.6062081Z contiguous: bool, 2025-05-07T20:32:16.6062171Z compiled: bool, 2025-05-07T20:32:16.6062250Z ) -> None: 2025-05-07T20:32:16.6062344Z torch.manual_seed(2025) 2025-05-07T20:32:16.6062419Z 2025-05-07T20:32:16.6062589Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6062666Z 2025-05-07T20:32:16.6062761Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6062886Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6062973Z x = x_sign * x_clamp 2025-05-07T20:32:16.6063062Z x0 = x[:, :D] 2025-05-07T20:32:16.6063140Z x1 = x[:, D:] 2025-05-07T20:32:16.6063213Z 2025-05-07T20:32:16.6063300Z if contiguous: 2025-05-07T20:32:16.6063390Z x0 = x0.contiguous() 2025-05-07T20:32:16.6063483Z x1 = x1.contiguous() 2025-05-07T20:32:16.6063552Z 2025-05-07T20:32:16.6063644Z if scale_ub is not None: 2025-05-07T20:32:16.6063753Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6063890Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6063967Z ) 2025-05-07T20:32:16.6064047Z else: 2025-05-07T20:32:16.6064140Z scale_ub_tensor = None 2025-05-07T20:32:16.6064210Z 2025-05-07T20:32:16.6064344Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6064480Z op = silu_mul_quant 2025-05-07T20:32:16.6064565Z if compiled: 2025-05-07T20:32:16.6064668Z op = torch.compile(op) 2025-05-07T20:32:16.6064773Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6064853Z 2025-05-07T20:32:16.6064985Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6064989Z 2025-05-07T20:32:16.6065085Z moe/activation_test.py:117: 2025-05-07T20:32:16.6065219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6065322Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6065421Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6065810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.6065902Z return fn(*args, **kwargs) 2025-05-07T20:32:16.6066423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6066528Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6066900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6067134Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6067496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6067588Z kernel = self.compile( 2025-05-07T20:32:16.6067994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6068171Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6068304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6068308Z 2025-05-07T20:32:16.6068517Z self = 2025-05-07T20:32:16.6069350Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6069877Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9638a549a0>} 2025-05-07T20:32:16.6070747Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6070948Z context = 2025-05-07T20:32:16.6070953Z 2025-05-07T20:32:16.6071122Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6071399Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6071516Z module_map=module_map) 2025-05-07T20:32:16.6071680Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6071783Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6071862Z E ^ 2025-05-07T20:32:16.6072236Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6072240Z 2025-05-07T20:32:16.6072688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6072693Z 2025-05-07T20:32:16.6072795Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6073034Z self=, 2025-05-07T20:32:16.6073111Z T=2048, 2025-05-07T20:32:16.6073186Z D=7168, 2025-05-07T20:32:16.6073273Z scale_ub=1200.0, 2025-05-07T20:32:16.6073425Z contiguous=False, 2025-05-07T20:32:16.6073509Z compiled=False, 2025-05-07T20:32:16.6073584Z ) 2025-05-07T20:32:16.6073809Z self = 2025-05-07T20:32:16.6073992Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:16.6074036Z 2025-05-07T20:32:16.6074116Z @given( 2025-05-07T20:32:16.6074239Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6074343Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6074467Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6074584Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6074701Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6074776Z ) 2025-05-07T20:32:16.6075030Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6075128Z def test_silu_mul_quant( 2025-05-07T20:32:16.6075206Z self, 2025-05-07T20:32:16.6075280Z T: int, 2025-05-07T20:32:16.6075360Z D: int, 2025-05-07T20:32:16.6075458Z scale_ub: Optional[float], 2025-05-07T20:32:16.6075545Z contiguous: bool, 2025-05-07T20:32:16.6075634Z compiled: bool, 2025-05-07T20:32:16.6075709Z ) -> None: 2025-05-07T20:32:16.6075809Z torch.manual_seed(2025) 2025-05-07T20:32:16.6075882Z 2025-05-07T20:32:16.6076053Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6076131Z 2025-05-07T20:32:16.6076230Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6076354Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6076448Z x = x_sign * x_clamp 2025-05-07T20:32:16.6076529Z x0 = x[:, :D] 2025-05-07T20:32:16.6076608Z x1 = x[:, D:] 2025-05-07T20:32:16.6076682Z 2025-05-07T20:32:16.6076766Z if contiguous: 2025-05-07T20:32:16.6076856Z x0 = x0.contiguous() 2025-05-07T20:32:16.6076948Z x1 = x1.contiguous() 2025-05-07T20:32:16.6077018Z 2025-05-07T20:32:16.6077109Z if scale_ub is not None: 2025-05-07T20:32:16.6077212Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6077348Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6077428Z ) 2025-05-07T20:32:16.6077504Z else: 2025-05-07T20:32:16.6077596Z scale_ub_tensor = None 2025-05-07T20:32:16.6077671Z 2025-05-07T20:32:16.6077800Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6077971Z op = silu_mul_quant 2025-05-07T20:32:16.6078061Z if compiled: 2025-05-07T20:32:16.6078157Z op = torch.compile(op) 2025-05-07T20:32:16.6078266Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6078341Z 2025-05-07T20:32:16.6078429Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6078434Z 2025-05-07T20:32:16.6078534Z moe/activation_test.py:117: 2025-05-07T20:32:16.6078672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6078771Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6078873Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6079400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6079498Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6079887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6080117Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6080480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6080573Z kernel = self.compile( 2025-05-07T20:32:16.6080974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6081200Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6081329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6081333Z 2025-05-07T20:32:16.6081546Z self = 2025-05-07T20:32:16.6082411Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6082944Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9638a556c0>} 2025-05-07T20:32:16.6083744Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6083943Z context = 2025-05-07T20:32:16.6083948Z 2025-05-07T20:32:16.6084121Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6084462Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6084619Z module_map=module_map) 2025-05-07T20:32:16.6084793Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6084894Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6084970Z E ^ 2025-05-07T20:32:16.6085347Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6085351Z 2025-05-07T20:32:16.6085786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6085794Z 2025-05-07T20:32:16.6085902Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6086134Z self=, 2025-05-07T20:32:16.6086209Z T=1, 2025-05-07T20:32:16.6086288Z D=7168, 2025-05-07T20:32:16.6086370Z scale_ub=None, 2025-05-07T20:32:16.6086456Z contiguous=True, 2025-05-07T20:32:16.6086544Z compiled=False, 2025-05-07T20:32:16.6086613Z ) 2025-05-07T20:32:16.6086843Z self = 2025-05-07T20:32:16.6087110Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.6087115Z 2025-05-07T20:32:16.6087191Z @given( 2025-05-07T20:32:16.6087317Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6087414Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6087525Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6087648Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6087765Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6087841Z ) 2025-05-07T20:32:16.6088095Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6088186Z def test_silu_mul_quant( 2025-05-07T20:32:16.6088267Z self, 2025-05-07T20:32:16.6088342Z T: int, 2025-05-07T20:32:16.6088416Z D: int, 2025-05-07T20:32:16.6088517Z scale_ub: Optional[float], 2025-05-07T20:32:16.6088604Z contiguous: bool, 2025-05-07T20:32:16.6088696Z compiled: bool, 2025-05-07T20:32:16.6088780Z ) -> None: 2025-05-07T20:32:16.6088872Z torch.manual_seed(2025) 2025-05-07T20:32:16.6088944Z 2025-05-07T20:32:16.6089121Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6089194Z 2025-05-07T20:32:16.6089284Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6089419Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6089549Z x = x_sign * x_clamp 2025-05-07T20:32:16.6089633Z x0 = x[:, :D] 2025-05-07T20:32:16.6089711Z x1 = x[:, D:] 2025-05-07T20:32:16.6089782Z 2025-05-07T20:32:16.6089869Z if contiguous: 2025-05-07T20:32:16.6089958Z x0 = x0.contiguous() 2025-05-07T20:32:16.6090083Z x1 = x1.contiguous() 2025-05-07T20:32:16.6090160Z 2025-05-07T20:32:16.6090248Z if scale_ub is not None: 2025-05-07T20:32:16.6090351Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6090495Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6090569Z ) 2025-05-07T20:32:16.6090643Z else: 2025-05-07T20:32:16.6090740Z scale_ub_tensor = None 2025-05-07T20:32:16.6090810Z 2025-05-07T20:32:16.6090944Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6091034Z op = silu_mul_quant 2025-05-07T20:32:16.6091122Z if compiled: 2025-05-07T20:32:16.6091224Z op = torch.compile(op) 2025-05-07T20:32:16.6091330Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6091400Z 2025-05-07T20:32:16.6091493Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6091497Z 2025-05-07T20:32:16.6091593Z moe/activation_test.py:117: 2025-05-07T20:32:16.6091726Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6091883Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6091983Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6092520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6092618Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6092996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6093237Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6093598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6093691Z kernel = self.compile( 2025-05-07T20:32:16.6094097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6094277Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6094412Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6094498Z 2025-05-07T20:32:16.6094711Z self = 2025-05-07T20:32:16.6095535Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6096068Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f963914de40>} 2025-05-07T20:32:16.6096864Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6097080Z context = 2025-05-07T20:32:16.6097086Z 2025-05-07T20:32:16.6097286Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6097571Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6097676Z module_map=module_map) 2025-05-07T20:32:16.6097839Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6097942Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6098061Z E ^ 2025-05-07T20:32:16.6098433Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6098438Z 2025-05-07T20:32:16.6098879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6098922Z 2025-05-07T20:32:16.6099026Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6099261Z self=, 2025-05-07T20:32:16.6099336Z T=16384, 2025-05-07T20:32:16.6099416Z D=7168, 2025-05-07T20:32:16.6099500Z scale_ub=1200.0, 2025-05-07T20:32:16.6099585Z contiguous=False, 2025-05-07T20:32:16.6099667Z compiled=True, 2025-05-07T20:32:16.6099743Z ) 2025-05-07T20:32:16.6099973Z self = 2025-05-07T20:32:16.6100157Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:16.6100169Z 2025-05-07T20:32:16.6100245Z @given( 2025-05-07T20:32:16.6100364Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6100471Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6100589Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6100706Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6100826Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6100898Z ) 2025-05-07T20:32:16.6101159Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6101255Z def test_silu_mul_quant( 2025-05-07T20:32:16.6101328Z self, 2025-05-07T20:32:16.6101406Z T: int, 2025-05-07T20:32:16.6101485Z D: int, 2025-05-07T20:32:16.6101584Z scale_ub: Optional[float], 2025-05-07T20:32:16.6101678Z contiguous: bool, 2025-05-07T20:32:16.6101764Z compiled: bool, 2025-05-07T20:32:16.6101843Z ) -> None: 2025-05-07T20:32:16.6101944Z torch.manual_seed(2025) 2025-05-07T20:32:16.6102016Z 2025-05-07T20:32:16.6102185Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6102263Z 2025-05-07T20:32:16.6102353Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6102478Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6107570Z x = x_sign * x_clamp 2025-05-07T20:32:16.6107675Z x0 = x[:, :D] 2025-05-07T20:32:16.6107757Z x1 = x[:, D:] 2025-05-07T20:32:16.6107827Z 2025-05-07T20:32:16.6108106Z if contiguous: 2025-05-07T20:32:16.6108205Z x0 = x0.contiguous() 2025-05-07T20:32:16.6108291Z x1 = x1.contiguous() 2025-05-07T20:32:16.6108362Z 2025-05-07T20:32:16.6108451Z if scale_ub is not None: 2025-05-07T20:32:16.6108555Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6108695Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6108778Z ) 2025-05-07T20:32:16.6108853Z else: 2025-05-07T20:32:16.6108950Z scale_ub_tensor = None 2025-05-07T20:32:16.6109022Z 2025-05-07T20:32:16.6109155Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6109248Z op = silu_mul_quant 2025-05-07T20:32:16.6109331Z if compiled: 2025-05-07T20:32:16.6109433Z op = torch.compile(op) 2025-05-07T20:32:16.6109539Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6109609Z 2025-05-07T20:32:16.6109696Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6109707Z 2025-05-07T20:32:16.6109806Z moe/activation_test.py:117: 2025-05-07T20:32:16.6109937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6110040Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6110138Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6110530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.6110692Z return fn(*args, **kwargs) 2025-05-07T20:32:16.6111210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6111307Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6111739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6111967Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6112327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6112421Z kernel = self.compile( 2025-05-07T20:32:16.6112815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6112997Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6113137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6113141Z 2025-05-07T20:32:16.6113353Z self = 2025-05-07T20:32:16.6114159Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6114687Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f963914fb00>} 2025-05-07T20:32:16.6115461Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6115659Z context = 2025-05-07T20:32:16.6115665Z 2025-05-07T20:32:16.6115836Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6116112Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6116218Z module_map=module_map) 2025-05-07T20:32:16.6116385Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6116485Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6116563Z E ^ 2025-05-07T20:32:16.6117012Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6117018Z 2025-05-07T20:32:16.6117452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6117456Z 2025-05-07T20:32:16.6117559Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6117821Z self=, 2025-05-07T20:32:16.6117912Z T=1, 2025-05-07T20:32:16.6117995Z D=7168, 2025-05-07T20:32:16.6118082Z scale_ub=None, 2025-05-07T20:32:16.6118166Z contiguous=False, 2025-05-07T20:32:16.6118248Z compiled=False, 2025-05-07T20:32:16.6118329Z ) 2025-05-07T20:32:16.6118552Z self = 2025-05-07T20:32:16.6118726Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:16.6118731Z 2025-05-07T20:32:16.6118810Z @given( 2025-05-07T20:32:16.6118930Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6119032Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6119146Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6119260Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6119377Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6119493Z ) 2025-05-07T20:32:16.6119745Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6119842Z def test_silu_mul_quant( 2025-05-07T20:32:16.6119915Z self, 2025-05-07T20:32:16.6119993Z T: int, 2025-05-07T20:32:16.6120069Z D: int, 2025-05-07T20:32:16.6120210Z scale_ub: Optional[float], 2025-05-07T20:32:16.6120301Z contiguous: bool, 2025-05-07T20:32:16.6120384Z compiled: bool, 2025-05-07T20:32:16.6120461Z ) -> None: 2025-05-07T20:32:16.6120557Z torch.manual_seed(2025) 2025-05-07T20:32:16.6120632Z 2025-05-07T20:32:16.6120803Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6120877Z 2025-05-07T20:32:16.6120966Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6121089Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6121180Z x = x_sign * x_clamp 2025-05-07T20:32:16.6121259Z x0 = x[:, :D] 2025-05-07T20:32:16.6121340Z x1 = x[:, D:] 2025-05-07T20:32:16.6121414Z 2025-05-07T20:32:16.6121496Z if contiguous: 2025-05-07T20:32:16.6121590Z x0 = x0.contiguous() 2025-05-07T20:32:16.6121675Z x1 = x1.contiguous() 2025-05-07T20:32:16.6121746Z 2025-05-07T20:32:16.6121838Z if scale_ub is not None: 2025-05-07T20:32:16.6121944Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6122078Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6122154Z ) 2025-05-07T20:32:16.6122233Z else: 2025-05-07T20:32:16.6122327Z scale_ub_tensor = None 2025-05-07T20:32:16.6122403Z 2025-05-07T20:32:16.6122532Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6122619Z op = silu_mul_quant 2025-05-07T20:32:16.6122706Z if compiled: 2025-05-07T20:32:16.6122805Z op = torch.compile(op) 2025-05-07T20:32:16.6122913Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6122984Z 2025-05-07T20:32:16.6123074Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6123078Z 2025-05-07T20:32:16.6123174Z moe/activation_test.py:117: 2025-05-07T20:32:16.6123304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6123403Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6123505Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6124099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6124199Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6124573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6124800Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6125155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6125250Z kernel = self.compile( 2025-05-07T20:32:16.6125645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6125828Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6125961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6125966Z 2025-05-07T20:32:16.6126179Z self = 2025-05-07T20:32:16.6126992Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6127513Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96ce1200e0>} 2025-05-07T20:32:16.6128340Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6128536Z context = 2025-05-07T20:32:16.6128578Z 2025-05-07T20:32:16.6128752Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6129032Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6129139Z module_map=module_map) 2025-05-07T20:32:16.6129308Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6129407Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6129486Z E ^ 2025-05-07T20:32:16.6129859Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6129866Z 2025-05-07T20:32:16.6130302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6130307Z 2025-05-07T20:32:16.6130412Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6130646Z self=, 2025-05-07T20:32:16.6130723Z T=2048, 2025-05-07T20:32:16.6130797Z D=7168, 2025-05-07T20:32:16.6130878Z scale_ub=None, 2025-05-07T20:32:16.6130966Z contiguous=False, 2025-05-07T20:32:16.6131052Z compiled=True, 2025-05-07T20:32:16.6131123Z ) 2025-05-07T20:32:16.6131353Z self = 2025-05-07T20:32:16.6131531Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:16.6131535Z 2025-05-07T20:32:16.6131609Z @given( 2025-05-07T20:32:16.6131731Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6131916Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6132035Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6132151Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6132264Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6132344Z ) 2025-05-07T20:32:16.6132595Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6132687Z def test_silu_mul_quant( 2025-05-07T20:32:16.6132765Z self, 2025-05-07T20:32:16.6132925Z T: int, 2025-05-07T20:32:16.6133004Z D: int, 2025-05-07T20:32:16.6133105Z scale_ub: Optional[float], 2025-05-07T20:32:16.6133193Z contiguous: bool, 2025-05-07T20:32:16.6133276Z compiled: bool, 2025-05-07T20:32:16.6133357Z ) -> None: 2025-05-07T20:32:16.6133449Z torch.manual_seed(2025) 2025-05-07T20:32:16.6133524Z 2025-05-07T20:32:16.6133693Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6133768Z 2025-05-07T20:32:16.6133861Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6133984Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6134070Z x = x_sign * x_clamp 2025-05-07T20:32:16.6134151Z x0 = x[:, :D] 2025-05-07T20:32:16.6134237Z x1 = x[:, D:] 2025-05-07T20:32:16.6134307Z 2025-05-07T20:32:16.6134391Z if contiguous: 2025-05-07T20:32:16.6134480Z x0 = x0.contiguous() 2025-05-07T20:32:16.6134566Z x1 = x1.contiguous() 2025-05-07T20:32:16.6134646Z 2025-05-07T20:32:16.6134735Z if scale_ub is not None: 2025-05-07T20:32:16.6134836Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6134976Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6135050Z ) 2025-05-07T20:32:16.6135126Z else: 2025-05-07T20:32:16.6135218Z scale_ub_tensor = None 2025-05-07T20:32:16.6135333Z 2025-05-07T20:32:16.6135463Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6135550Z op = silu_mul_quant 2025-05-07T20:32:16.6135633Z if compiled: 2025-05-07T20:32:16.6135734Z op = torch.compile(op) 2025-05-07T20:32:16.6135839Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6135950Z 2025-05-07T20:32:16.6136041Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6136045Z 2025-05-07T20:32:16.6136138Z moe/activation_test.py:117: 2025-05-07T20:32:16.6136277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6136375Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6136475Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6136861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.6136953Z return fn(*args, **kwargs) 2025-05-07T20:32:16.6137525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6137627Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6138002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6138239Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6138593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6138690Z kernel = self.compile( 2025-05-07T20:32:16.6139094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6139275Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6139405Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6139417Z 2025-05-07T20:32:16.6139626Z self = 2025-05-07T20:32:16.6140445Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6140984Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9639a56ac0>} 2025-05-07T20:32:16.6141875Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6142073Z context = 2025-05-07T20:32:16.6142078Z 2025-05-07T20:32:16.6142249Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6142525Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6142634Z module_map=module_map) 2025-05-07T20:32:16.6142797Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6142899Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6142979Z E ^ 2025-05-07T20:32:16.6143352Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6143363Z 2025-05-07T20:32:16.6143803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6143807Z 2025-05-07T20:32:16.6143910Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6144142Z self=, 2025-05-07T20:32:16.6144220Z T=4096, 2025-05-07T20:32:16.6144338Z D=7168, 2025-05-07T20:32:16.6144422Z scale_ub=None, 2025-05-07T20:32:16.6144506Z contiguous=False, 2025-05-07T20:32:16.6144589Z compiled=True, 2025-05-07T20:32:16.6144661Z ) 2025-05-07T20:32:16.6144884Z self = 2025-05-07T20:32:16.6145062Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:16.6145107Z 2025-05-07T20:32:16.6145183Z @given( 2025-05-07T20:32:16.6145301Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6145403Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6145518Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6145635Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6145751Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6145824Z ) 2025-05-07T20:32:16.6146076Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6146172Z def test_silu_mul_quant( 2025-05-07T20:32:16.6146247Z self, 2025-05-07T20:32:16.6146322Z T: int, 2025-05-07T20:32:16.6146397Z D: int, 2025-05-07T20:32:16.6146492Z scale_ub: Optional[float], 2025-05-07T20:32:16.6146578Z contiguous: bool, 2025-05-07T20:32:16.6146664Z compiled: bool, 2025-05-07T20:32:16.6146742Z ) -> None: 2025-05-07T20:32:16.6146835Z torch.manual_seed(2025) 2025-05-07T20:32:16.6146906Z 2025-05-07T20:32:16.6147075Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6147154Z 2025-05-07T20:32:16.6147245Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6147368Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6147457Z x = x_sign * x_clamp 2025-05-07T20:32:16.6147534Z x0 = x[:, :D] 2025-05-07T20:32:16.6147613Z x1 = x[:, D:] 2025-05-07T20:32:16.6147703Z 2025-05-07T20:32:16.6147790Z if contiguous: 2025-05-07T20:32:16.6147904Z x0 = x0.contiguous() 2025-05-07T20:32:16.6147996Z x1 = x1.contiguous() 2025-05-07T20:32:16.6148068Z 2025-05-07T20:32:16.6148156Z if scale_ub is not None: 2025-05-07T20:32:16.6148261Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6148397Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6148473Z ) 2025-05-07T20:32:16.6148549Z else: 2025-05-07T20:32:16.6148641Z scale_ub_tensor = None 2025-05-07T20:32:16.6148711Z 2025-05-07T20:32:16.6148918Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6149008Z op = silu_mul_quant 2025-05-07T20:32:16.6149095Z if compiled: 2025-05-07T20:32:16.6149191Z op = torch.compile(op) 2025-05-07T20:32:16.6149296Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6149369Z 2025-05-07T20:32:16.6149456Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6149463Z 2025-05-07T20:32:16.6149558Z moe/activation_test.py:117: 2025-05-07T20:32:16.6149691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6149788Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6149889Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6150275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.6150370Z return fn(*args, **kwargs) 2025-05-07T20:32:16.6150898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6150996Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6151370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6151605Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6152005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6152103Z kernel = self.compile( 2025-05-07T20:32:16.6152504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6152723Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6152855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6152860Z 2025-05-07T20:32:16.6153077Z self = 2025-05-07T20:32:16.6153897Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6154423Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9639a55b20>} 2025-05-07T20:32:16.6155215Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6155417Z context = 2025-05-07T20:32:16.6155421Z 2025-05-07T20:32:16.6155591Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6155875Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6155980Z module_map=module_map) 2025-05-07T20:32:16.6156144Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6156248Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6156324Z E ^ 2025-05-07T20:32:16.6156699Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6156708Z 2025-05-07T20:32:16.6157146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6157150Z 2025-05-07T20:32:16.6157253Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6157486Z self=, 2025-05-07T20:32:16.6157562Z T=16384, 2025-05-07T20:32:16.6157635Z D=5120, 2025-05-07T20:32:16.6157794Z scale_ub=1200.0, 2025-05-07T20:32:16.6157879Z contiguous=False, 2025-05-07T20:32:16.6157969Z compiled=False, 2025-05-07T20:32:16.6158038Z ) 2025-05-07T20:32:16.6158262Z self = 2025-05-07T20:32:16.6158449Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:16.6158453Z 2025-05-07T20:32:16.6158529Z @given( 2025-05-07T20:32:16.6158647Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6158749Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6158862Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6158976Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6159094Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6159165Z ) 2025-05-07T20:32:16.6159419Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6159517Z def test_silu_mul_quant( 2025-05-07T20:32:16.6159590Z self, 2025-05-07T20:32:16.6159666Z T: int, 2025-05-07T20:32:16.6159740Z D: int, 2025-05-07T20:32:16.6159834Z scale_ub: Optional[float], 2025-05-07T20:32:16.6159926Z contiguous: bool, 2025-05-07T20:32:16.6160014Z compiled: bool, 2025-05-07T20:32:16.6160091Z ) -> None: 2025-05-07T20:32:16.6160188Z torch.manual_seed(2025) 2025-05-07T20:32:16.6160302Z 2025-05-07T20:32:16.6160470Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6160545Z 2025-05-07T20:32:16.6160634Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6160758Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6160843Z x = x_sign * x_clamp 2025-05-07T20:32:16.6160963Z x0 = x[:, :D] 2025-05-07T20:32:16.6161043Z x1 = x[:, D:] 2025-05-07T20:32:16.6161113Z 2025-05-07T20:32:16.6161194Z if contiguous: 2025-05-07T20:32:16.6161291Z x0 = x0.contiguous() 2025-05-07T20:32:16.6161378Z x1 = x1.contiguous() 2025-05-07T20:32:16.6161448Z 2025-05-07T20:32:16.6161538Z if scale_ub is not None: 2025-05-07T20:32:16.6161641Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6161775Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6161851Z ) 2025-05-07T20:32:16.6161930Z else: 2025-05-07T20:32:16.6162021Z scale_ub_tensor = None 2025-05-07T20:32:16.6162094Z 2025-05-07T20:32:16.6162220Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6162311Z op = silu_mul_quant 2025-05-07T20:32:16.6162395Z if compiled: 2025-05-07T20:32:16.6162490Z op = torch.compile(op) 2025-05-07T20:32:16.6162599Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6162669Z 2025-05-07T20:32:16.6162755Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6162760Z 2025-05-07T20:32:16.6162862Z moe/activation_test.py:117: 2025-05-07T20:32:16.6162990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6163087Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6163186Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6163711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6163816Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6164190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6164420Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6164784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6164876Z kernel = self.compile( 2025-05-07T20:32:16.6165366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6165550Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6165680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6165685Z 2025-05-07T20:32:16.6165897Z self = 2025-05-07T20:32:16.6166714Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6167248Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9639a54c20>} 2025-05-07T20:32:16.6168093Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6168291Z context = 2025-05-07T20:32:16.6168295Z 2025-05-07T20:32:16.6168464Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6168738Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6168889Z module_map=module_map) 2025-05-07T20:32:16.6169055Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6169158Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6169234Z E ^ 2025-05-07T20:32:16.6169605Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6169675Z 2025-05-07T20:32:16.6170122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6170126Z 2025-05-07T20:32:16.6170228Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6170464Z self=, 2025-05-07T20:32:16.6170539Z T=16384, 2025-05-07T20:32:16.6170613Z D=5120, 2025-05-07T20:32:16.6170695Z scale_ub=1200.0, 2025-05-07T20:32:16.6170778Z contiguous=True, 2025-05-07T20:32:16.6170862Z compiled=True, 2025-05-07T20:32:16.6170934Z ) 2025-05-07T20:32:16.6171157Z self = 2025-05-07T20:32:16.6171336Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.6171340Z 2025-05-07T20:32:16.6171421Z @given( 2025-05-07T20:32:16.6171539Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6171637Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6171809Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6171945Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6172060Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6172133Z ) 2025-05-07T20:32:16.6172385Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6172480Z def test_silu_mul_quant( 2025-05-07T20:32:16.6172553Z self, 2025-05-07T20:32:16.6172632Z T: int, 2025-05-07T20:32:16.6172710Z D: int, 2025-05-07T20:32:16.6172806Z scale_ub: Optional[float], 2025-05-07T20:32:16.6172891Z contiguous: bool, 2025-05-07T20:32:16.6172976Z compiled: bool, 2025-05-07T20:32:16.6173052Z ) -> None: 2025-05-07T20:32:16.6173146Z torch.manual_seed(2025) 2025-05-07T20:32:16.6173219Z 2025-05-07T20:32:16.6173387Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6173461Z 2025-05-07T20:32:16.6173549Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6173756Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6173853Z x = x_sign * x_clamp 2025-05-07T20:32:16.6173935Z x0 = x[:, :D] 2025-05-07T20:32:16.6174011Z x1 = x[:, D:] 2025-05-07T20:32:16.6174085Z 2025-05-07T20:32:16.6174167Z if contiguous: 2025-05-07T20:32:16.6174255Z x0 = x0.contiguous() 2025-05-07T20:32:16.6174346Z x1 = x1.contiguous() 2025-05-07T20:32:16.6174418Z 2025-05-07T20:32:16.6174504Z if scale_ub is not None: 2025-05-07T20:32:16.6174610Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6174744Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6174819Z ) 2025-05-07T20:32:16.6174893Z else: 2025-05-07T20:32:16.6174986Z scale_ub_tensor = None 2025-05-07T20:32:16.6175056Z 2025-05-07T20:32:16.6175185Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6175272Z op = silu_mul_quant 2025-05-07T20:32:16.6175361Z if compiled: 2025-05-07T20:32:16.6175457Z op = torch.compile(op) 2025-05-07T20:32:16.6175559Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6175634Z 2025-05-07T20:32:16.6175721Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6175725Z 2025-05-07T20:32:16.6175822Z moe/activation_test.py:117: 2025-05-07T20:32:16.6175950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6176093Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6176193Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6176578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.6176709Z return fn(*args, **kwargs) 2025-05-07T20:32:16.6177230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6177330Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6177707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6177935Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6178289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6178386Z kernel = self.compile( 2025-05-07T20:32:16.6178783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6178961Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6179092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6179100Z 2025-05-07T20:32:16.6179308Z self = 2025-05-07T20:32:16.6180128Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6180650Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96385af060>} 2025-05-07T20:32:16.6181445Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6181638Z context = 2025-05-07T20:32:16.6181645Z 2025-05-07T20:32:16.6181812Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6182087Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6182265Z module_map=module_map) 2025-05-07T20:32:16.6182429Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6182530Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6182604Z E ^ 2025-05-07T20:32:16.6182976Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6182983Z 2025-05-07T20:32:16.6183416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6183420Z 2025-05-07T20:32:16.6183523Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6183757Z self=, 2025-05-07T20:32:16.6183834Z T=16384, 2025-05-07T20:32:16.6183911Z D=5120, 2025-05-07T20:32:16.6183991Z scale_ub=None, 2025-05-07T20:32:16.6184075Z contiguous=False, 2025-05-07T20:32:16.6184162Z compiled=True, 2025-05-07T20:32:16.6184237Z ) 2025-05-07T20:32:16.6184461Z self = 2025-05-07T20:32:16.6184644Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:16.6184649Z 2025-05-07T20:32:16.6184722Z @given( 2025-05-07T20:32:16.6184839Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6184982Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6185095Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6185213Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6185323Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6185394Z ) 2025-05-07T20:32:16.6185650Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6185782Z def test_silu_mul_quant( 2025-05-07T20:32:16.6185856Z self, 2025-05-07T20:32:16.6185935Z T: int, 2025-05-07T20:32:16.6186013Z D: int, 2025-05-07T20:32:16.6186109Z scale_ub: Optional[float], 2025-05-07T20:32:16.6186199Z contiguous: bool, 2025-05-07T20:32:16.6186282Z compiled: bool, 2025-05-07T20:32:16.6186358Z ) -> None: 2025-05-07T20:32:16.6186453Z torch.manual_seed(2025) 2025-05-07T20:32:16.6186525Z 2025-05-07T20:32:16.6186693Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6186769Z 2025-05-07T20:32:16.6186857Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6186983Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6187067Z x = x_sign * x_clamp 2025-05-07T20:32:16.6187144Z x0 = x[:, :D] 2025-05-07T20:32:16.6187226Z x1 = x[:, D:] 2025-05-07T20:32:16.6187297Z 2025-05-07T20:32:16.6187378Z if contiguous: 2025-05-07T20:32:16.6187472Z x0 = x0.contiguous() 2025-05-07T20:32:16.6187559Z x1 = x1.contiguous() 2025-05-07T20:32:16.6187629Z 2025-05-07T20:32:16.6187725Z if scale_ub is not None: 2025-05-07T20:32:16.6187826Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6187960Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6188039Z ) 2025-05-07T20:32:16.6188114Z else: 2025-05-07T20:32:16.6188207Z scale_ub_tensor = None 2025-05-07T20:32:16.6188275Z 2025-05-07T20:32:16.6188404Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6188495Z op = silu_mul_quant 2025-05-07T20:32:16.6188577Z if compiled: 2025-05-07T20:32:16.6188672Z op = torch.compile(op) 2025-05-07T20:32:16.6188779Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6188850Z 2025-05-07T20:32:16.6188939Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6188944Z 2025-05-07T20:32:16.6189041Z moe/activation_test.py:117: 2025-05-07T20:32:16.6189169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6189355Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6189454Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6189839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.6189935Z return fn(*args, **kwargs) 2025-05-07T20:32:16.6190451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6190549Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6190925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6191152Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6191509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6191600Z kernel = self.compile( 2025-05-07T20:32:16.6192003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6192185Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6192313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6192318Z 2025-05-07T20:32:16.6192529Z self = 2025-05-07T20:32:16.6193385Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6193945Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9638200b80>} 2025-05-07T20:32:16.6194736Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6194928Z context = 2025-05-07T20:32:16.6194933Z 2025-05-07T20:32:16.6195102Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6195375Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6195480Z module_map=module_map) 2025-05-07T20:32:16.6195644Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6195740Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6195817Z E ^ 2025-05-07T20:32:16.6196185Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6196190Z 2025-05-07T20:32:16.6196620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6196624Z 2025-05-07T20:32:16.6196726Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6196953Z self=, 2025-05-07T20:32:16.6197031Z T=2048, 2025-05-07T20:32:16.6197109Z D=5120, 2025-05-07T20:32:16.6197192Z scale_ub=None, 2025-05-07T20:32:16.6197278Z contiguous=False, 2025-05-07T20:32:16.6197362Z compiled=True, 2025-05-07T20:32:16.6197432Z ) 2025-05-07T20:32:16.6197657Z self = 2025-05-07T20:32:16.6197833Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:16.6197840Z 2025-05-07T20:32:16.6197913Z @given( 2025-05-07T20:32:16.6198033Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6198129Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6199067Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6199207Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6199319Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6199395Z ) 2025-05-07T20:32:16.6199648Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6199740Z def test_silu_mul_quant( 2025-05-07T20:32:16.6199822Z self, 2025-05-07T20:32:16.6199898Z T: int, 2025-05-07T20:32:16.6199971Z D: int, 2025-05-07T20:32:16.6200071Z scale_ub: Optional[float], 2025-05-07T20:32:16.6200157Z contiguous: bool, 2025-05-07T20:32:16.6200241Z compiled: bool, 2025-05-07T20:32:16.6200318Z ) -> None: 2025-05-07T20:32:16.6200413Z torch.manual_seed(2025) 2025-05-07T20:32:16.6200483Z 2025-05-07T20:32:16.6200656Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6200727Z 2025-05-07T20:32:16.6200825Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6200960Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6201046Z x = x_sign * x_clamp 2025-05-07T20:32:16.6201127Z x0 = x[:, :D] 2025-05-07T20:32:16.6201203Z x1 = x[:, D:] 2025-05-07T20:32:16.6201271Z 2025-05-07T20:32:16.6201355Z if contiguous: 2025-05-07T20:32:16.6201445Z x0 = x0.contiguous() 2025-05-07T20:32:16.6201604Z x1 = x1.contiguous() 2025-05-07T20:32:16.6201676Z 2025-05-07T20:32:16.6201762Z if scale_ub is not None: 2025-05-07T20:32:16.6201865Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6202007Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6202122Z ) 2025-05-07T20:32:16.6202196Z else: 2025-05-07T20:32:16.6202291Z scale_ub_tensor = None 2025-05-07T20:32:16.6202360Z 2025-05-07T20:32:16.6202493Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6202588Z op = silu_mul_quant 2025-05-07T20:32:16.6202671Z if compiled: 2025-05-07T20:32:16.6202774Z op = torch.compile(op) 2025-05-07T20:32:16.6202880Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6202948Z 2025-05-07T20:32:16.6203040Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6203044Z 2025-05-07T20:32:16.6203140Z moe/activation_test.py:117: 2025-05-07T20:32:16.6203270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6203373Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6203469Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6203851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.6203947Z return fn(*args, **kwargs) 2025-05-07T20:32:16.6204462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6204560Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6204927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6205151Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6205508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6205602Z kernel = self.compile( 2025-05-07T20:32:16.6206002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6206465Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6206606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6206611Z 2025-05-07T20:32:16.6206820Z self = 2025-05-07T20:32:16.6207773Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6208296Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96382020c0>} 2025-05-07T20:32:16.6209077Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6209267Z context = 2025-05-07T20:32:16.6209278Z 2025-05-07T20:32:16.6209445Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6209725Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6209836Z module_map=module_map) 2025-05-07T20:32:16.6209998Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6210096Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6210172Z E ^ 2025-05-07T20:32:16.6210537Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6210602Z 2025-05-07T20:32:16.6211033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6211038Z 2025-05-07T20:32:16.6211138Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6211425Z self=, 2025-05-07T20:32:16.6211505Z T=2048, 2025-05-07T20:32:16.6211578Z D=5120, 2025-05-07T20:32:16.6211661Z scale_ub=1200.0, 2025-05-07T20:32:16.6211814Z contiguous=False, 2025-05-07T20:32:16.6211897Z compiled=True, 2025-05-07T20:32:16.6211970Z ) 2025-05-07T20:32:16.6212198Z self = 2025-05-07T20:32:16.6212374Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:16.6212379Z 2025-05-07T20:32:16.6212456Z @given( 2025-05-07T20:32:16.6212572Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6212671Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6212788Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6212902Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6213013Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6213093Z ) 2025-05-07T20:32:16.6213344Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6213438Z def test_silu_mul_quant( 2025-05-07T20:32:16.6213512Z self, 2025-05-07T20:32:16.6213591Z T: int, 2025-05-07T20:32:16.6213668Z D: int, 2025-05-07T20:32:16.6213764Z scale_ub: Optional[float], 2025-05-07T20:32:16.6213850Z contiguous: bool, 2025-05-07T20:32:16.6213939Z compiled: bool, 2025-05-07T20:32:16.6214014Z ) -> None: 2025-05-07T20:32:16.6214105Z torch.manual_seed(2025) 2025-05-07T20:32:16.6214177Z 2025-05-07T20:32:16.6214346Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6214417Z 2025-05-07T20:32:16.6214510Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6214634Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6214723Z x = x_sign * x_clamp 2025-05-07T20:32:16.6214800Z x0 = x[:, :D] 2025-05-07T20:32:16.6214884Z x1 = x[:, D:] 2025-05-07T20:32:16.6214957Z 2025-05-07T20:32:16.6215038Z if contiguous: 2025-05-07T20:32:16.6215128Z x0 = x0.contiguous() 2025-05-07T20:32:16.6215302Z x1 = x1.contiguous() 2025-05-07T20:32:16.6215374Z 2025-05-07T20:32:16.6215462Z if scale_ub is not None: 2025-05-07T20:32:16.6215572Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6215707Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6215781Z ) 2025-05-07T20:32:16.6215858Z else: 2025-05-07T20:32:16.6215948Z scale_ub_tensor = None 2025-05-07T20:32:16.6216019Z 2025-05-07T20:32:16.6216153Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6216243Z op = silu_mul_quant 2025-05-07T20:32:16.6216330Z if compiled: 2025-05-07T20:32:16.6216425Z op = torch.compile(op) 2025-05-07T20:32:16.6216528Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6216603Z 2025-05-07T20:32:16.6216689Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6216694Z 2025-05-07T20:32:16.6216788Z moe/activation_test.py:117: 2025-05-07T20:32:16.6216925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6217023Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6217120Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6217502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.6217591Z return fn(*args, **kwargs) 2025-05-07T20:32:16.6218153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6218248Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6218614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6218886Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6219242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6219339Z kernel = self.compile( 2025-05-07T20:32:16.6219732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6219910Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6220041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6220047Z 2025-05-07T20:32:16.6220255Z self = 2025-05-07T20:32:16.6221062Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6221589Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96382032e0>} 2025-05-07T20:32:16.6222372Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6222567Z context = 2025-05-07T20:32:16.6222571Z 2025-05-07T20:32:16.6222737Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6223013Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6223119Z module_map=module_map) 2025-05-07T20:32:16.6223279Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6223382Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6223455Z E ^ 2025-05-07T20:32:16.6223819Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6223900Z 2025-05-07T20:32:16.6224334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6224339Z 2025-05-07T20:32:16.6224438Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6224670Z self=, 2025-05-07T20:32:16.6224746Z T=4096, 2025-05-07T20:32:16.6224819Z D=5120, 2025-05-07T20:32:16.6224904Z scale_ub=1200.0, 2025-05-07T20:32:16.6224985Z contiguous=True, 2025-05-07T20:32:16.6225065Z compiled=True, 2025-05-07T20:32:16.6225140Z ) 2025-05-07T20:32:16.6225362Z self = 2025-05-07T20:32:16.6225542Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.6225547Z 2025-05-07T20:32:16.6225620Z @given( 2025-05-07T20:32:16.6225738Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6225842Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6225955Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6226070Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6226184Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6226256Z ) 2025-05-07T20:32:16.6226505Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6226644Z def test_silu_mul_quant( 2025-05-07T20:32:16.6226719Z self, 2025-05-07T20:32:16.6226796Z T: int, 2025-05-07T20:32:16.6226872Z D: int, 2025-05-07T20:32:16.6226966Z scale_ub: Optional[float], 2025-05-07T20:32:16.6227053Z contiguous: bool, 2025-05-07T20:32:16.6227179Z compiled: bool, 2025-05-07T20:32:16.6227255Z ) -> None: 2025-05-07T20:32:16.6227350Z torch.manual_seed(2025) 2025-05-07T20:32:16.6227419Z 2025-05-07T20:32:16.6227592Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6227668Z 2025-05-07T20:32:16.6227757Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6227881Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6227972Z x = x_sign * x_clamp 2025-05-07T20:32:16.6228053Z x0 = x[:, :D] 2025-05-07T20:32:16.6228132Z x1 = x[:, D:] 2025-05-07T20:32:16.6228207Z 2025-05-07T20:32:16.6228293Z if contiguous: 2025-05-07T20:32:16.6228385Z x0 = x0.contiguous() 2025-05-07T20:32:16.6228470Z x1 = x1.contiguous() 2025-05-07T20:32:16.6228541Z 2025-05-07T20:32:16.6228631Z if scale_ub is not None: 2025-05-07T20:32:16.6228733Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6228870Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6233629Z ) 2025-05-07T20:32:16.6233724Z else: 2025-05-07T20:32:16.6233822Z scale_ub_tensor = None 2025-05-07T20:32:16.6233898Z 2025-05-07T20:32:16.6234042Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6234137Z op = silu_mul_quant 2025-05-07T20:32:16.6234224Z if compiled: 2025-05-07T20:32:16.6234324Z op = torch.compile(op) 2025-05-07T20:32:16.6234432Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6234502Z 2025-05-07T20:32:16.6234593Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6234601Z 2025-05-07T20:32:16.6234701Z moe/activation_test.py:117: 2025-05-07T20:32:16.6234833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6234935Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6235037Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6235425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.6235520Z return fn(*args, **kwargs) 2025-05-07T20:32:16.6236196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6236295Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6236666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6236891Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6237254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6237362Z kernel = self.compile( 2025-05-07T20:32:16.6237781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6237964Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6238095Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6238100Z 2025-05-07T20:32:16.6238316Z self = 2025-05-07T20:32:16.6239124Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6239642Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f94f7e7c860>} 2025-05-07T20:32:16.6240463Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6240694Z context = 2025-05-07T20:32:16.6240699Z 2025-05-07T20:32:16.6240867Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6241143Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6241251Z module_map=module_map) 2025-05-07T20:32:16.6241416Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6241512Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6241591Z E ^ 2025-05-07T20:32:16.6241960Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6241970Z 2025-05-07T20:32:16.6242395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6242400Z 2025-05-07T20:32:16.6242503Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6242735Z self=, 2025-05-07T20:32:16.6242809Z T=128, 2025-05-07T20:32:16.6242885Z D=5120, 2025-05-07T20:32:16.6242973Z scale_ub=1200.0, 2025-05-07T20:32:16.6243056Z contiguous=False, 2025-05-07T20:32:16.6243136Z compiled=True, 2025-05-07T20:32:16.6243209Z ) 2025-05-07T20:32:16.6243433Z self = 2025-05-07T20:32:16.6243611Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:16.6243615Z 2025-05-07T20:32:16.6243692Z @given( 2025-05-07T20:32:16.6243808Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6243908Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6244021Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6244135Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6244254Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6244325Z ) 2025-05-07T20:32:16.6244574Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6244751Z def test_silu_mul_quant( 2025-05-07T20:32:16.6244828Z self, 2025-05-07T20:32:16.6244906Z T: int, 2025-05-07T20:32:16.6244981Z D: int, 2025-05-07T20:32:16.6245078Z scale_ub: Optional[float], 2025-05-07T20:32:16.6245174Z contiguous: bool, 2025-05-07T20:32:16.6245258Z compiled: bool, 2025-05-07T20:32:16.6245335Z ) -> None: 2025-05-07T20:32:16.6245435Z torch.manual_seed(2025) 2025-05-07T20:32:16.6245507Z 2025-05-07T20:32:16.6245678Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6245754Z 2025-05-07T20:32:16.6245844Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6245967Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6246061Z x = x_sign * x_clamp 2025-05-07T20:32:16.6246140Z x0 = x[:, :D] 2025-05-07T20:32:16.6246219Z x1 = x[:, D:] 2025-05-07T20:32:16.6246289Z 2025-05-07T20:32:16.6246370Z if contiguous: 2025-05-07T20:32:16.6246468Z x0 = x0.contiguous() 2025-05-07T20:32:16.6246554Z x1 = x1.contiguous() 2025-05-07T20:32:16.6246623Z 2025-05-07T20:32:16.6246713Z if scale_ub is not None: 2025-05-07T20:32:16.6246816Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6246955Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6247033Z ) 2025-05-07T20:32:16.6247172Z else: 2025-05-07T20:32:16.6247274Z scale_ub_tensor = None 2025-05-07T20:32:16.6247364Z 2025-05-07T20:32:16.6247501Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6247589Z op = silu_mul_quant 2025-05-07T20:32:16.6247673Z if compiled: 2025-05-07T20:32:16.6247769Z op = torch.compile(op) 2025-05-07T20:32:16.6247919Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6247989Z 2025-05-07T20:32:16.6248079Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6248083Z 2025-05-07T20:32:16.6248188Z moe/activation_test.py:117: 2025-05-07T20:32:16.6248317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6248416Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6248524Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6248902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.6249001Z return fn(*args, **kwargs) 2025-05-07T20:32:16.6249511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6249607Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6249977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6250204Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6250561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6250654Z kernel = self.compile( 2025-05-07T20:32:16.6251045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6251224Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6251353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6251360Z 2025-05-07T20:32:16.6251567Z self = 2025-05-07T20:32:16.6252458Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6253069Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f94f7e7d580>} 2025-05-07T20:32:16.6253851Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6254042Z context = 2025-05-07T20:32:16.6254049Z 2025-05-07T20:32:16.6254216Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6254484Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6254589Z module_map=module_map) 2025-05-07T20:32:16.6254758Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6254856Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6254929Z E ^ 2025-05-07T20:32:16.6255304Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6255308Z 2025-05-07T20:32:16.6255735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6255740Z 2025-05-07T20:32:16.6255842Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6256070Z self=, 2025-05-07T20:32:16.6256188Z T=16384, 2025-05-07T20:32:16.6256267Z D=7168, 2025-05-07T20:32:16.6256346Z scale_ub=1200.0, 2025-05-07T20:32:16.6256429Z contiguous=True, 2025-05-07T20:32:16.6256512Z compiled=True, 2025-05-07T20:32:16.6256582Z ) 2025-05-07T20:32:16.6256802Z self = 2025-05-07T20:32:16.6257023Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.6257028Z 2025-05-07T20:32:16.6257100Z @given( 2025-05-07T20:32:16.6257232Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6257329Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6257441Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6257563Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6257674Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6257746Z ) 2025-05-07T20:32:16.6258002Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6258093Z def test_silu_mul_quant( 2025-05-07T20:32:16.6258166Z self, 2025-05-07T20:32:16.6258243Z T: int, 2025-05-07T20:32:16.6258316Z D: int, 2025-05-07T20:32:16.6258414Z scale_ub: Optional[float], 2025-05-07T20:32:16.6258506Z contiguous: bool, 2025-05-07T20:32:16.6258590Z compiled: bool, 2025-05-07T20:32:16.6258668Z ) -> None: 2025-05-07T20:32:16.6258759Z torch.manual_seed(2025) 2025-05-07T20:32:16.6258829Z 2025-05-07T20:32:16.6259005Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6259076Z 2025-05-07T20:32:16.6259167Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6259294Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6259381Z x = x_sign * x_clamp 2025-05-07T20:32:16.6259456Z x0 = x[:, :D] 2025-05-07T20:32:16.6259538Z x1 = x[:, D:] 2025-05-07T20:32:16.6259610Z 2025-05-07T20:32:16.6259694Z if contiguous: 2025-05-07T20:32:16.6259783Z x0 = x0.contiguous() 2025-05-07T20:32:16.6259868Z x1 = x1.contiguous() 2025-05-07T20:32:16.6259940Z 2025-05-07T20:32:16.6260028Z if scale_ub is not None: 2025-05-07T20:32:16.6260129Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6260269Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6260341Z ) 2025-05-07T20:32:16.6260415Z else: 2025-05-07T20:32:16.6260588Z scale_ub_tensor = None 2025-05-07T20:32:16.6260660Z 2025-05-07T20:32:16.6260788Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6260883Z op = silu_mul_quant 2025-05-07T20:32:16.6260967Z if compiled: 2025-05-07T20:32:16.6261062Z op = torch.compile(op) 2025-05-07T20:32:16.6261170Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6261247Z 2025-05-07T20:32:16.6261339Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6261343Z 2025-05-07T20:32:16.6261438Z moe/activation_test.py:117: 2025-05-07T20:32:16.6261568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6261668Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6261768Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6262149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.6262243Z return fn(*args, **kwargs) 2025-05-07T20:32:16.6262759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6262856Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6263219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6263443Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6263861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6263952Z kernel = self.compile( 2025-05-07T20:32:16.6264344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6264562Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6264697Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6264702Z 2025-05-07T20:32:16.6264913Z self = 2025-05-07T20:32:16.6265722Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6266249Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f94f7e7e0c0>} 2025-05-07T20:32:16.6267022Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6267215Z context = 2025-05-07T20:32:16.6267219Z 2025-05-07T20:32:16.6267391Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6267660Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6267768Z module_map=module_map) 2025-05-07T20:32:16.6267931Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6268027Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6268108Z E ^ 2025-05-07T20:32:16.6268474Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6268478Z 2025-05-07T20:32:16.6268903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6268913Z 2025-05-07T20:32:16.6269012Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6269239Z self=, 2025-05-07T20:32:16.6269391Z T=16384, 2025-05-07T20:32:16.6269468Z D=5120, 2025-05-07T20:32:16.6269548Z scale_ub=1200.0, 2025-05-07T20:32:16.6269631Z contiguous=True, 2025-05-07T20:32:16.6269713Z compiled=False, 2025-05-07T20:32:16.6269783Z ) 2025-05-07T20:32:16.6270009Z self = 2025-05-07T20:32:16.6270187Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.6270194Z 2025-05-07T20:32:16.6270274Z @given( 2025-05-07T20:32:16.6270391Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6270486Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6270604Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6270721Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6270834Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6270909Z ) 2025-05-07T20:32:16.6271164Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6271255Z def test_silu_mul_quant( 2025-05-07T20:32:16.6271330Z self, 2025-05-07T20:32:16.6271403Z T: int, 2025-05-07T20:32:16.6271475Z D: int, 2025-05-07T20:32:16.6271574Z scale_ub: Optional[float], 2025-05-07T20:32:16.6271661Z contiguous: bool, 2025-05-07T20:32:16.6271748Z compiled: bool, 2025-05-07T20:32:16.6271870Z ) -> None: 2025-05-07T20:32:16.6271961Z torch.manual_seed(2025) 2025-05-07T20:32:16.6272032Z 2025-05-07T20:32:16.6272202Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6272275Z 2025-05-07T20:32:16.6272367Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6272488Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6272615Z x = x_sign * x_clamp 2025-05-07T20:32:16.6272695Z x0 = x[:, :D] 2025-05-07T20:32:16.6272772Z x1 = x[:, D:] 2025-05-07T20:32:16.6272843Z 2025-05-07T20:32:16.6272931Z if contiguous: 2025-05-07T20:32:16.6273019Z x0 = x0.contiguous() 2025-05-07T20:32:16.6273104Z x1 = x1.contiguous() 2025-05-07T20:32:16.6273182Z 2025-05-07T20:32:16.6273270Z if scale_ub is not None: 2025-05-07T20:32:16.6273376Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6273510Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6273585Z ) 2025-05-07T20:32:16.6273666Z else: 2025-05-07T20:32:16.6273758Z scale_ub_tensor = None 2025-05-07T20:32:16.6273827Z 2025-05-07T20:32:16.6273957Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6274044Z op = silu_mul_quant 2025-05-07T20:32:16.6274130Z if compiled: 2025-05-07T20:32:16.6274229Z op = torch.compile(op) 2025-05-07T20:32:16.6274331Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6274400Z 2025-05-07T20:32:16.6274497Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6274502Z 2025-05-07T20:32:16.6274596Z moe/activation_test.py:117: 2025-05-07T20:32:16.6274728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6274824Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6274921Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6275444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6275545Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6275918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6276149Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6276504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6276681Z kernel = self.compile( 2025-05-07T20:32:16.6277081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6277262Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6277399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6277404Z 2025-05-07T20:32:16.6277619Z self = 2025-05-07T20:32:16.6278443Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6278978Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f94f7e7f1a0>} 2025-05-07T20:32:16.6282314Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6282534Z context = 2025-05-07T20:32:16.6282540Z 2025-05-07T20:32:16.6282709Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6283047Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6283155Z module_map=module_map) 2025-05-07T20:32:16.6283317Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6283417Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6283534Z E ^ 2025-05-07T20:32:16.6283906Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6283910Z 2025-05-07T20:32:16.6284351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6284356Z 2025-05-07T20:32:16.6284482Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6284713Z self=, 2025-05-07T20:32:16.6284794Z T=1, 2025-05-07T20:32:16.6284869Z D=7168, 2025-05-07T20:32:16.6284952Z scale_ub=1200.0, 2025-05-07T20:32:16.6285043Z contiguous=False, 2025-05-07T20:32:16.6285126Z compiled=False, 2025-05-07T20:32:16.6285200Z ) 2025-05-07T20:32:16.6285426Z self = 2025-05-07T20:32:16.6285599Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:16.6285607Z 2025-05-07T20:32:16.6285686Z @given( 2025-05-07T20:32:16.6285805Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6285902Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6286020Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6286138Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6286254Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6286333Z ) 2025-05-07T20:32:16.6286583Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6286679Z def test_silu_mul_quant( 2025-05-07T20:32:16.6286759Z self, 2025-05-07T20:32:16.6286838Z T: int, 2025-05-07T20:32:16.6286916Z D: int, 2025-05-07T20:32:16.6287018Z scale_ub: Optional[float], 2025-05-07T20:32:16.6287108Z contiguous: bool, 2025-05-07T20:32:16.6287198Z compiled: bool, 2025-05-07T20:32:16.6287280Z ) -> None: 2025-05-07T20:32:16.6287374Z torch.manual_seed(2025) 2025-05-07T20:32:16.6287449Z 2025-05-07T20:32:16.6287621Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6287696Z 2025-05-07T20:32:16.6287842Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6287968Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6288059Z x = x_sign * x_clamp 2025-05-07T20:32:16.6288144Z x0 = x[:, :D] 2025-05-07T20:32:16.6288230Z x1 = x[:, D:] 2025-05-07T20:32:16.6288301Z 2025-05-07T20:32:16.6288389Z if contiguous: 2025-05-07T20:32:16.6288481Z x0 = x0.contiguous() 2025-05-07T20:32:16.6288570Z x1 = x1.contiguous() 2025-05-07T20:32:16.6288647Z 2025-05-07T20:32:16.6288735Z if scale_ub is not None: 2025-05-07T20:32:16.6288842Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6288979Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6289058Z ) 2025-05-07T20:32:16.6289138Z else: 2025-05-07T20:32:16.6289232Z scale_ub_tensor = None 2025-05-07T20:32:16.6289305Z 2025-05-07T20:32:16.6289444Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6289534Z op = silu_mul_quant 2025-05-07T20:32:16.6289619Z if compiled: 2025-05-07T20:32:16.6289810Z op = torch.compile(op) 2025-05-07T20:32:16.6289918Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6289991Z 2025-05-07T20:32:16.6290086Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6290090Z 2025-05-07T20:32:16.6290227Z moe/activation_test.py:117: 2025-05-07T20:32:16.6290361Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6290460Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6290560Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6291082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6291217Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6291589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6291889Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6292244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6292344Z kernel = self.compile( 2025-05-07T20:32:16.6292738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6292919Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6293052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6293057Z 2025-05-07T20:32:16.6293265Z self = 2025-05-07T20:32:16.6294081Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6294603Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9638bc0680>} 2025-05-07T20:32:16.6295380Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6295579Z context = 2025-05-07T20:32:16.6295583Z 2025-05-07T20:32:16.6295749Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6296024Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6296131Z module_map=module_map) 2025-05-07T20:32:16.6296362Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6296465Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6296548Z E ^ 2025-05-07T20:32:16.6296915Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6296920Z 2025-05-07T20:32:16.6297350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6297361Z 2025-05-07T20:32:16.6297461Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6297690Z self=, 2025-05-07T20:32:16.6297767Z T=4096, 2025-05-07T20:32:16.6297842Z D=7168, 2025-05-07T20:32:16.6297928Z scale_ub=1200.0, 2025-05-07T20:32:16.6298017Z contiguous=False, 2025-05-07T20:32:16.6298100Z compiled=True, 2025-05-07T20:32:16.6298170Z ) 2025-05-07T20:32:16.6298401Z self = 2025-05-07T20:32:16.6298580Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:16.6298585Z 2025-05-07T20:32:16.6298662Z @given( 2025-05-07T20:32:16.6298841Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6298941Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6299059Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6299226Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6299384Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6299493Z ) 2025-05-07T20:32:16.6299817Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6299912Z def test_silu_mul_quant( 2025-05-07T20:32:16.6300050Z self, 2025-05-07T20:32:16.6300127Z T: int, 2025-05-07T20:32:16.6300202Z D: int, 2025-05-07T20:32:16.6300301Z scale_ub: Optional[float], 2025-05-07T20:32:16.6300392Z contiguous: bool, 2025-05-07T20:32:16.6300480Z compiled: bool, 2025-05-07T20:32:16.6300558Z ) -> None: 2025-05-07T20:32:16.6300652Z torch.manual_seed(2025) 2025-05-07T20:32:16.6300732Z 2025-05-07T20:32:16.6300904Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6300977Z 2025-05-07T20:32:16.6301073Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6301202Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6301290Z x = x_sign * x_clamp 2025-05-07T20:32:16.6301372Z x0 = x[:, :D] 2025-05-07T20:32:16.6301450Z x1 = x[:, D:] 2025-05-07T20:32:16.6301520Z 2025-05-07T20:32:16.6301603Z if contiguous: 2025-05-07T20:32:16.6301692Z x0 = x0.contiguous() 2025-05-07T20:32:16.6301781Z x1 = x1.contiguous() 2025-05-07T20:32:16.6301856Z 2025-05-07T20:32:16.6301945Z if scale_ub is not None: 2025-05-07T20:32:16.6302053Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6302193Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6302267Z ) 2025-05-07T20:32:16.6302344Z else: 2025-05-07T20:32:16.6302439Z scale_ub_tensor = None 2025-05-07T20:32:16.6302508Z 2025-05-07T20:32:16.6302642Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6302734Z op = silu_mul_quant 2025-05-07T20:32:16.6302819Z if compiled: 2025-05-07T20:32:16.6302921Z op = torch.compile(op) 2025-05-07T20:32:16.6303025Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6303098Z 2025-05-07T20:32:16.6303189Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6303194Z 2025-05-07T20:32:16.6303294Z moe/activation_test.py:117: 2025-05-07T20:32:16.6303426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6303524Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6303670Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6304052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.6304146Z return fn(*args, **kwargs) 2025-05-07T20:32:16.6304655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6304760Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6305126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6305355Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6305706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6305802Z kernel = self.compile( 2025-05-07T20:32:16.6306695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6306881Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6307100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6307105Z 2025-05-07T20:32:16.6307317Z self = 2025-05-07T20:32:16.6308128Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6308707Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9638bc1940>} 2025-05-07T20:32:16.6309554Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6309751Z context = 2025-05-07T20:32:16.6309755Z 2025-05-07T20:32:16.6309924Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6310193Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6310305Z module_map=module_map) 2025-05-07T20:32:16.6310469Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6310572Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6310648Z E ^ 2025-05-07T20:32:16.6311015Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6311022Z 2025-05-07T20:32:16.6311454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6311461Z 2025-05-07T20:32:16.6311564Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6311802Z self=, 2025-05-07T20:32:16.6311878Z T=128, 2025-05-07T20:32:16.6311952Z D=7168, 2025-05-07T20:32:16.6312036Z scale_ub=1200.0, 2025-05-07T20:32:16.6312125Z contiguous=False, 2025-05-07T20:32:16.6312207Z compiled=True, 2025-05-07T20:32:16.6312285Z ) 2025-05-07T20:32:16.6312508Z self = 2025-05-07T20:32:16.6312686Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:16.6312690Z 2025-05-07T20:32:16.6312770Z @given( 2025-05-07T20:32:16.6312887Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6312989Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6313107Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6313287Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6313407Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6313481Z ) 2025-05-07T20:32:16.6313733Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6313832Z def test_silu_mul_quant( 2025-05-07T20:32:16.6313907Z self, 2025-05-07T20:32:16.6313984Z T: int, 2025-05-07T20:32:16.6314066Z D: int, 2025-05-07T20:32:16.6314164Z scale_ub: Optional[float], 2025-05-07T20:32:16.6314253Z contiguous: bool, 2025-05-07T20:32:16.6314341Z compiled: bool, 2025-05-07T20:32:16.6314417Z ) -> None: 2025-05-07T20:32:16.6314512Z torch.manual_seed(2025) 2025-05-07T20:32:16.6314584Z 2025-05-07T20:32:16.6314760Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6314838Z 2025-05-07T20:32:16.6314928Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6315056Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6315148Z x = x_sign * x_clamp 2025-05-07T20:32:16.6315225Z x0 = x[:, :D] 2025-05-07T20:32:16.6315304Z x1 = x[:, D:] 2025-05-07T20:32:16.6315429Z 2025-05-07T20:32:16.6315515Z if contiguous: 2025-05-07T20:32:16.6315606Z x0 = x0.contiguous() 2025-05-07T20:32:16.6315698Z x1 = x1.contiguous() 2025-05-07T20:32:16.6315807Z 2025-05-07T20:32:16.6315896Z if scale_ub is not None: 2025-05-07T20:32:16.6316002Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6316137Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6316217Z ) 2025-05-07T20:32:16.6316295Z else: 2025-05-07T20:32:16.6316386Z scale_ub_tensor = None 2025-05-07T20:32:16.6316502Z 2025-05-07T20:32:16.6316633Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6316721Z op = silu_mul_quant 2025-05-07T20:32:16.6316807Z if compiled: 2025-05-07T20:32:16.6316908Z op = torch.compile(op) 2025-05-07T20:32:16.6317012Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6317087Z 2025-05-07T20:32:16.6317181Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6317185Z 2025-05-07T20:32:16.6317283Z moe/activation_test.py:117: 2025-05-07T20:32:16.6317413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6317515Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6317619Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6317997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.6318088Z return fn(*args, **kwargs) 2025-05-07T20:32:16.6318602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6318697Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6319071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6319300Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6319650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6319747Z kernel = self.compile( 2025-05-07T20:32:16.6320143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6320321Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6320454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6320461Z 2025-05-07T20:32:16.6320671Z self = 2025-05-07T20:32:16.6321528Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6322051Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9638bc2700>} 2025-05-07T20:32:16.6322829Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6323023Z context = 2025-05-07T20:32:16.6323028Z 2025-05-07T20:32:16.6323195Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6323473Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6323583Z module_map=module_map) 2025-05-07T20:32:16.6323746Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6323847Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6323970Z E ^ 2025-05-07T20:32:16.6324339Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6324343Z 2025-05-07T20:32:16.6324813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6324818Z 2025-05-07T20:32:16.6324920Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6325153Z self=, 2025-05-07T20:32:16.6325296Z T=2048, 2025-05-07T20:32:16.6325377Z D=7168, 2025-05-07T20:32:16.6325457Z scale_ub=None, 2025-05-07T20:32:16.6325541Z contiguous=True, 2025-05-07T20:32:16.6325624Z compiled=True, 2025-05-07T20:32:16.6325696Z ) 2025-05-07T20:32:16.6325923Z self = 2025-05-07T20:32:16.6326099Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.6326106Z 2025-05-07T20:32:16.6326181Z @given( 2025-05-07T20:32:16.6326299Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6326402Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6326520Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6326638Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6326751Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6326823Z ) 2025-05-07T20:32:16.6327078Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6327173Z def test_silu_mul_quant( 2025-05-07T20:32:16.6327247Z self, 2025-05-07T20:32:16.6327327Z T: int, 2025-05-07T20:32:16.6327401Z D: int, 2025-05-07T20:32:16.6327501Z scale_ub: Optional[float], 2025-05-07T20:32:16.6327594Z contiguous: bool, 2025-05-07T20:32:16.6327678Z compiled: bool, 2025-05-07T20:32:16.6327753Z ) -> None: 2025-05-07T20:32:16.6327851Z torch.manual_seed(2025) 2025-05-07T20:32:16.6327925Z 2025-05-07T20:32:16.6328094Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6328174Z 2025-05-07T20:32:16.6328266Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6328395Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6328481Z x = x_sign * x_clamp 2025-05-07T20:32:16.6328558Z x0 = x[:, :D] 2025-05-07T20:32:16.6328639Z x1 = x[:, D:] 2025-05-07T20:32:16.6328709Z 2025-05-07T20:32:16.6328797Z if contiguous: 2025-05-07T20:32:16.6328890Z x0 = x0.contiguous() 2025-05-07T20:32:16.6328978Z x1 = x1.contiguous() 2025-05-07T20:32:16.6329051Z 2025-05-07T20:32:16.6329144Z if scale_ub is not None: 2025-05-07T20:32:16.6329294Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6329435Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6329517Z ) 2025-05-07T20:32:16.6329591Z else: 2025-05-07T20:32:16.6329687Z scale_ub_tensor = None 2025-05-07T20:32:16.6329759Z 2025-05-07T20:32:16.6329887Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6329982Z op = silu_mul_quant 2025-05-07T20:32:16.6330064Z if compiled: 2025-05-07T20:32:16.6330161Z op = torch.compile(op) 2025-05-07T20:32:16.6330271Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6330341Z 2025-05-07T20:32:16.6330431Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6330439Z 2025-05-07T20:32:16.6330537Z moe/activation_test.py:117: 2025-05-07T20:32:16.6330668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6330774Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6330874Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6331301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.6331401Z return fn(*args, **kwargs) 2025-05-07T20:32:16.6331971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6332112Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6332485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6332713Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6333106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6333199Z kernel = self.compile( 2025-05-07T20:32:16.6333596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6333778Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6333911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6333915Z 2025-05-07T20:32:16.6334129Z self = 2025-05-07T20:32:16.6334940Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6335461Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9638bc37e0>} 2025-05-07T20:32:16.6336249Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6336445Z context = 2025-05-07T20:32:16.6336449Z 2025-05-07T20:32:16.6336618Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6336889Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6336997Z module_map=module_map) 2025-05-07T20:32:16.6337163Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6337261Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6337338Z E ^ 2025-05-07T20:32:16.6337711Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6337715Z 2025-05-07T20:32:16.6338186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6338191Z 2025-05-07T20:32:16.6338298Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6338529Z self=, 2025-05-07T20:32:16.6338608Z T=16384, 2025-05-07T20:32:16.6338687Z D=5120, 2025-05-07T20:32:16.6338769Z scale_ub=None, 2025-05-07T20:32:16.6338858Z contiguous=False, 2025-05-07T20:32:16.6338945Z compiled=False, 2025-05-07T20:32:16.6339015Z ) 2025-05-07T20:32:16.6339241Z self = 2025-05-07T20:32:16.6339423Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:16.6339427Z 2025-05-07T20:32:16.6339504Z @given( 2025-05-07T20:32:16.6339628Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6339726Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6339843Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6339963Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6340076Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6340201Z ) 2025-05-07T20:32:16.6340455Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6340545Z def test_silu_mul_quant( 2025-05-07T20:32:16.6340623Z self, 2025-05-07T20:32:16.6340737Z T: int, 2025-05-07T20:32:16.6340812Z D: int, 2025-05-07T20:32:16.6340912Z scale_ub: Optional[float], 2025-05-07T20:32:16.6341000Z contiguous: bool, 2025-05-07T20:32:16.6341084Z compiled: bool, 2025-05-07T20:32:16.6341167Z ) -> None: 2025-05-07T20:32:16.6341262Z torch.manual_seed(2025) 2025-05-07T20:32:16.6341373Z 2025-05-07T20:32:16.6341547Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6341618Z 2025-05-07T20:32:16.6341708Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6341843Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6343748Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6343762Z 2025-05-07T20:32:16.6343882Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:16.6343889Z 2025-05-07T20:32:16.6343991Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6344222Z self=, 2025-05-07T20:32:16.6344298Z T=4096, 2025-05-07T20:32:16.6344371Z D=7168, 2025-05-07T20:32:16.6344456Z scale_ub=1200.0, 2025-05-07T20:32:16.6344540Z contiguous=True, 2025-05-07T20:32:16.6344628Z compiled=True, 2025-05-07T20:32:16.6344702Z ) 2025-05-07T20:32:16.6344927Z self = 2025-05-07T20:32:16.6345108Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.6345116Z 2025-05-07T20:32:16.6345190Z @given( 2025-05-07T20:32:16.6345307Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6345411Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6345524Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6345641Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6345759Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6345831Z ) 2025-05-07T20:32:16.6346133Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6346235Z def test_silu_mul_quant( 2025-05-07T20:32:16.6346312Z self, 2025-05-07T20:32:16.6346391Z T: int, 2025-05-07T20:32:16.6346472Z D: int, 2025-05-07T20:32:16.6346568Z scale_ub: Optional[float], 2025-05-07T20:32:16.6346660Z contiguous: bool, 2025-05-07T20:32:16.6346745Z compiled: bool, 2025-05-07T20:32:16.6346825Z ) -> None: 2025-05-07T20:32:16.6346922Z torch.manual_seed(2025) 2025-05-07T20:32:16.6346994Z 2025-05-07T20:32:16.6347161Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6347237Z 2025-05-07T20:32:16.6347329Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6347452Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6349385Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6349427Z 2025-05-07T20:32:16.6349547Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:16.6349555Z 2025-05-07T20:32:16.6349656Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6349884Z self=, 2025-05-07T20:32:16.6349962Z T=16384, 2025-05-07T20:32:16.6350084Z D=7168, 2025-05-07T20:32:16.6350165Z scale_ub=None, 2025-05-07T20:32:16.6350254Z contiguous=False, 2025-05-07T20:32:16.6350341Z compiled=False, 2025-05-07T20:32:16.6350411Z ) 2025-05-07T20:32:16.6350639Z self = 2025-05-07T20:32:16.6350820Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:16.6350825Z 2025-05-07T20:32:16.6350902Z @given( 2025-05-07T20:32:16.6351023Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6351122Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6351237Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6351353Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6351465Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6351541Z ) 2025-05-07T20:32:16.6351794Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6351891Z def test_silu_mul_quant( 2025-05-07T20:32:16.6351969Z self, 2025-05-07T20:32:16.6352043Z T: int, 2025-05-07T20:32:16.6352118Z D: int, 2025-05-07T20:32:16.6352217Z scale_ub: Optional[float], 2025-05-07T20:32:16.6352308Z contiguous: bool, 2025-05-07T20:32:16.6352397Z compiled: bool, 2025-05-07T20:32:16.6352475Z ) -> None: 2025-05-07T20:32:16.6352569Z torch.manual_seed(2025) 2025-05-07T20:32:16.6352642Z 2025-05-07T20:32:16.6352811Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6354683Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6354698Z 2025-05-07T20:32:16.6354861Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.6354866Z 2025-05-07T20:32:16.6354969Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6355200Z self=, 2025-05-07T20:32:16.6355275Z T=2048, 2025-05-07T20:32:16.6355349Z D=7168, 2025-05-07T20:32:16.6355434Z scale_ub=1200.0, 2025-05-07T20:32:16.6355517Z contiguous=True, 2025-05-07T20:32:16.6355602Z compiled=True, 2025-05-07T20:32:16.6355677Z ) 2025-05-07T20:32:16.6355900Z self = 2025-05-07T20:32:16.6356078Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.6356083Z 2025-05-07T20:32:16.6356158Z @given( 2025-05-07T20:32:16.6356279Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6356381Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6356496Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6356613Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6356730Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6356803Z ) 2025-05-07T20:32:16.6357106Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6357202Z def test_silu_mul_quant( 2025-05-07T20:32:16.6357277Z self, 2025-05-07T20:32:16.6357421Z T: int, 2025-05-07T20:32:16.6357495Z D: int, 2025-05-07T20:32:16.6357591Z scale_ub: Optional[float], 2025-05-07T20:32:16.6357684Z contiguous: bool, 2025-05-07T20:32:16.6357769Z compiled: bool, 2025-05-07T20:32:16.6357849Z ) -> None: 2025-05-07T20:32:16.6357947Z torch.manual_seed(2025) 2025-05-07T20:32:16.6358059Z 2025-05-07T20:32:16.6358230Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6358307Z 2025-05-07T20:32:16.6358398Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6358529Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6360389Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6360397Z 2025-05-07T20:32:16.6360517Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:16.6360524Z 2025-05-07T20:32:16.6360625Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6365678Z self=, 2025-05-07T20:32:16.6365779Z T=2048, 2025-05-07T20:32:16.6365861Z D=7168, 2025-05-07T20:32:16.6365952Z scale_ub=None, 2025-05-07T20:32:16.6366040Z contiguous=True, 2025-05-07T20:32:16.6366126Z compiled=False, 2025-05-07T20:32:16.6366206Z ) 2025-05-07T20:32:16.6366437Z self = 2025-05-07T20:32:16.6366614Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.6366622Z 2025-05-07T20:32:16.6366701Z @given( 2025-05-07T20:32:16.6366820Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6366920Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6367038Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6367152Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6367274Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6367349Z ) 2025-05-07T20:32:16.6367602Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6367768Z def test_silu_mul_quant( 2025-05-07T20:32:16.6367846Z self, 2025-05-07T20:32:16.6367924Z T: int, 2025-05-07T20:32:16.6368005Z D: int, 2025-05-07T20:32:16.6368107Z scale_ub: Optional[float], 2025-05-07T20:32:16.6368196Z contiguous: bool, 2025-05-07T20:32:16.6368285Z compiled: bool, 2025-05-07T20:32:16.6368365Z ) -> None: 2025-05-07T20:32:16.6368462Z torch.manual_seed(2025) 2025-05-07T20:32:16.6368537Z 2025-05-07T20:32:16.6368711Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6368788Z 2025-05-07T20:32:16.6368881Z > x_sign = torch.sign(x) 2025-05-07T20:32:16.6370794Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6370808Z 2025-05-07T20:32:16.6370931Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:16.6370936Z 2025-05-07T20:32:16.6371074Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6371304Z self=, 2025-05-07T20:32:16.6371379Z T=1, 2025-05-07T20:32:16.6371457Z D=7168, 2025-05-07T20:32:16.6371544Z scale_ub=1200.0, 2025-05-07T20:32:16.6371626Z contiguous=True, 2025-05-07T20:32:16.6371835Z compiled=False, 2025-05-07T20:32:16.6371914Z ) 2025-05-07T20:32:16.6372140Z self = 2025-05-07T20:32:16.6372317Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.6372322Z 2025-05-07T20:32:16.6372401Z @given( 2025-05-07T20:32:16.6372519Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6372619Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6372736Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6372855Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6372974Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6373045Z ) 2025-05-07T20:32:16.6373297Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6373396Z def test_silu_mul_quant( 2025-05-07T20:32:16.6373473Z self, 2025-05-07T20:32:16.6373548Z T: int, 2025-05-07T20:32:16.6373630Z D: int, 2025-05-07T20:32:16.6373726Z scale_ub: Optional[float], 2025-05-07T20:32:16.6373814Z contiguous: bool, 2025-05-07T20:32:16.6373906Z compiled: bool, 2025-05-07T20:32:16.6373986Z ) -> None: 2025-05-07T20:32:16.6374080Z torch.manual_seed(2025) 2025-05-07T20:32:16.6374157Z 2025-05-07T20:32:16.6374330Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6374408Z 2025-05-07T20:32:16.6374499Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6374624Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6374719Z x = x_sign * x_clamp 2025-05-07T20:32:16.6374798Z x0 = x[:, :D] 2025-05-07T20:32:16.6374875Z x1 = x[:, D:] 2025-05-07T20:32:16.6374952Z 2025-05-07T20:32:16.6375034Z if contiguous: 2025-05-07T20:32:16.6375124Z x0 = x0.contiguous() 2025-05-07T20:32:16.6375216Z x1 = x1.contiguous() 2025-05-07T20:32:16.6375290Z 2025-05-07T20:32:16.6375380Z if scale_ub is not None: 2025-05-07T20:32:16.6375489Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6375625Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6375750Z ) 2025-05-07T20:32:16.6375826Z else: 2025-05-07T20:32:16.6375918Z scale_ub_tensor = None 2025-05-07T20:32:16.6375991Z 2025-05-07T20:32:16.6376124Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6376212Z op = silu_mul_quant 2025-05-07T20:32:16.6376300Z if compiled: 2025-05-07T20:32:16.6376402Z op = torch.compile(op) 2025-05-07T20:32:16.6376508Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6376582Z 2025-05-07T20:32:16.6376671Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6376675Z 2025-05-07T20:32:16.6376770Z moe/activation_test.py:117: 2025-05-07T20:32:16.6376907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6377008Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6377113Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6377636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6377735Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6378159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6378389Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6378788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6378882Z kernel = self.compile( 2025-05-07T20:32:16.6379278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6379500Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6379633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6379638Z 2025-05-07T20:32:16.6379850Z self = 2025-05-07T20:32:16.6380663Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6381180Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f94f7b3ab60>} 2025-05-07T20:32:16.6381961Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6382156Z context = 2025-05-07T20:32:16.6382161Z 2025-05-07T20:32:16.6382330Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6382602Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6382712Z module_map=module_map) 2025-05-07T20:32:16.6382879Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6382977Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6383054Z E ^ 2025-05-07T20:32:16.6383429Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6383433Z 2025-05-07T20:32:16.6383859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6383863Z 2025-05-07T20:32:16.6383970Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6384198Z self=, 2025-05-07T20:32:16.6384275Z T=128, 2025-05-07T20:32:16.6384355Z D=5120, 2025-05-07T20:32:16.6384478Z scale_ub=None, 2025-05-07T20:32:16.6384564Z contiguous=True, 2025-05-07T20:32:16.6384649Z compiled=False, 2025-05-07T20:32:16.6384722Z ) 2025-05-07T20:32:16.6384947Z self = 2025-05-07T20:32:16.6385127Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.6385131Z 2025-05-07T20:32:16.6385209Z @given( 2025-05-07T20:32:16.6385332Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6385429Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6385545Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6385663Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6385778Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6385849Z ) 2025-05-07T20:32:16.6386103Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6386197Z def test_silu_mul_quant( 2025-05-07T20:32:16.6386276Z self, 2025-05-07T20:32:16.6386350Z T: int, 2025-05-07T20:32:16.6386424Z D: int, 2025-05-07T20:32:16.6386575Z scale_ub: Optional[float], 2025-05-07T20:32:16.6386665Z contiguous: bool, 2025-05-07T20:32:16.6386748Z compiled: bool, 2025-05-07T20:32:16.6386832Z ) -> None: 2025-05-07T20:32:16.6386926Z torch.manual_seed(2025) 2025-05-07T20:32:16.6387036Z 2025-05-07T20:32:16.6387210Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6387281Z 2025-05-07T20:32:16.6387371Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6387499Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6387587Z x = x_sign * x_clamp 2025-05-07T20:32:16.6387711Z x0 = x[:, :D] 2025-05-07T20:32:16.6387788Z x1 = x[:, D:] 2025-05-07T20:32:16.6387859Z 2025-05-07T20:32:16.6387946Z if contiguous: 2025-05-07T20:32:16.6388039Z x0 = x0.contiguous() 2025-05-07T20:32:16.6388127Z x1 = x1.contiguous() 2025-05-07T20:32:16.6388204Z 2025-05-07T20:32:16.6388293Z if scale_ub is not None: 2025-05-07T20:32:16.6388400Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6388540Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6388615Z ) 2025-05-07T20:32:16.6388694Z else: 2025-05-07T20:32:16.6388790Z scale_ub_tensor = None 2025-05-07T20:32:16.6388860Z 2025-05-07T20:32:16.6388990Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6389082Z op = silu_mul_quant 2025-05-07T20:32:16.6389163Z if compiled: 2025-05-07T20:32:16.6389262Z op = torch.compile(op) 2025-05-07T20:32:16.6389369Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6389440Z 2025-05-07T20:32:16.6389532Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6389537Z 2025-05-07T20:32:16.6389635Z moe/activation_test.py:117: 2025-05-07T20:32:16.6389768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6389875Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6389975Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6390490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6390594Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6390965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6391195Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6391549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6391643Z kernel = self.compile( 2025-05-07T20:32:16.6392085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6392263Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6392401Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6392405Z 2025-05-07T20:32:16.6392612Z self = 2025-05-07T20:32:16.6393420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6393941Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f94f7b3bc40>} 2025-05-07T20:32:16.6394721Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6394989Z context = 2025-05-07T20:32:16.6394994Z 2025-05-07T20:32:16.6395162Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6395431Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6395582Z module_map=module_map) 2025-05-07T20:32:16.6395745Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6395848Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6395925Z E ^ 2025-05-07T20:32:16.6396291Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6396333Z 2025-05-07T20:32:16.6396770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6396775Z 2025-05-07T20:32:16.6396877Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6397111Z self=, 2025-05-07T20:32:16.6397187Z T=128, 2025-05-07T20:32:16.6397263Z D=7168, 2025-05-07T20:32:16.6397348Z scale_ub=None, 2025-05-07T20:32:16.6397432Z contiguous=True, 2025-05-07T20:32:16.6397520Z compiled=False, 2025-05-07T20:32:16.6397595Z ) 2025-05-07T20:32:16.6397819Z self = 2025-05-07T20:32:16.6397992Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.6397998Z 2025-05-07T20:32:16.6398097Z @given( 2025-05-07T20:32:16.6398234Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6398348Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6398462Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6398579Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6398697Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6398770Z ) 2025-05-07T20:32:16.6399025Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6399121Z def test_silu_mul_quant( 2025-05-07T20:32:16.6399197Z self, 2025-05-07T20:32:16.6399278Z T: int, 2025-05-07T20:32:16.6399357Z D: int, 2025-05-07T20:32:16.6399454Z scale_ub: Optional[float], 2025-05-07T20:32:16.6399544Z contiguous: bool, 2025-05-07T20:32:16.6399632Z compiled: bool, 2025-05-07T20:32:16.6399712Z ) -> None: 2025-05-07T20:32:16.6399809Z torch.manual_seed(2025) 2025-05-07T20:32:16.6399884Z 2025-05-07T20:32:16.6400053Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6400129Z 2025-05-07T20:32:16.6400221Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6400389Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6400487Z x = x_sign * x_clamp 2025-05-07T20:32:16.6400566Z x0 = x[:, :D] 2025-05-07T20:32:16.6400649Z x1 = x[:, D:] 2025-05-07T20:32:16.6400726Z 2025-05-07T20:32:16.6400809Z if contiguous: 2025-05-07T20:32:16.6400899Z x0 = x0.contiguous() 2025-05-07T20:32:16.6400993Z x1 = x1.contiguous() 2025-05-07T20:32:16.6401069Z 2025-05-07T20:32:16.6401159Z if scale_ub is not None: 2025-05-07T20:32:16.6401270Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6401406Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6401483Z ) 2025-05-07T20:32:16.6401559Z else: 2025-05-07T20:32:16.6401657Z scale_ub_tensor = None 2025-05-07T20:32:16.6401730Z 2025-05-07T20:32:16.6401859Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6401949Z op = silu_mul_quant 2025-05-07T20:32:16.6402042Z if compiled: 2025-05-07T20:32:16.6402140Z op = torch.compile(op) 2025-05-07T20:32:16.6402245Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6402371Z 2025-05-07T20:32:16.6402465Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6402469Z 2025-05-07T20:32:16.6402575Z moe/activation_test.py:117: 2025-05-07T20:32:16.6402708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6402847Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6402950Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6403468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6403603Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6403977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6404209Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6404564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6404661Z kernel = self.compile( 2025-05-07T20:32:16.6405055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6405237Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6405365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6405370Z 2025-05-07T20:32:16.6405578Z self = 2025-05-07T20:32:16.6406637Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6407164Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f94f7adcae0>} 2025-05-07T20:32:16.6407997Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6408194Z context = 2025-05-07T20:32:16.6408199Z 2025-05-07T20:32:16.6408371Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6408639Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6408748Z module_map=module_map) 2025-05-07T20:32:16.6408912Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6409092Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6409171Z E ^ 2025-05-07T20:32:16.6409542Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6409547Z 2025-05-07T20:32:16.6409974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6409978Z 2025-05-07T20:32:16.6410089Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6410317Z self=, 2025-05-07T20:32:16.6410392Z T=2048, 2025-05-07T20:32:16.6410469Z D=7168, 2025-05-07T20:32:16.6410553Z scale_ub=1200.0, 2025-05-07T20:32:16.6410638Z contiguous=True, 2025-05-07T20:32:16.6410729Z compiled=False, 2025-05-07T20:32:16.6410800Z ) 2025-05-07T20:32:16.6411026Z self = 2025-05-07T20:32:16.6411207Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.6411211Z 2025-05-07T20:32:16.6411287Z @given( 2025-05-07T20:32:16.6411408Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6411573Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6411690Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6411863Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6412045Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6412125Z ) 2025-05-07T20:32:16.6412376Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6412469Z def test_silu_mul_quant( 2025-05-07T20:32:16.6412550Z self, 2025-05-07T20:32:16.6412689Z T: int, 2025-05-07T20:32:16.6412769Z D: int, 2025-05-07T20:32:16.6412869Z scale_ub: Optional[float], 2025-05-07T20:32:16.6412960Z contiguous: bool, 2025-05-07T20:32:16.6413045Z compiled: bool, 2025-05-07T20:32:16.6413128Z ) -> None: 2025-05-07T20:32:16.6413222Z torch.manual_seed(2025) 2025-05-07T20:32:16.6413295Z 2025-05-07T20:32:16.6413473Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6415339Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6415349Z 2025-05-07T20:32:16.6415469Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.6415474Z 2025-05-07T20:32:16.6415575Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6415808Z self=, 2025-05-07T20:32:16.6415881Z T=1, 2025-05-07T20:32:16.6415956Z D=5120, 2025-05-07T20:32:16.6416042Z scale_ub=1200.0, 2025-05-07T20:32:16.6416126Z contiguous=True, 2025-05-07T20:32:16.6416208Z compiled=False, 2025-05-07T20:32:16.6416282Z ) 2025-05-07T20:32:16.6416506Z self = 2025-05-07T20:32:16.6416678Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.6416682Z 2025-05-07T20:32:16.6416762Z @given( 2025-05-07T20:32:16.6416881Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6416978Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6417097Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6417211Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6417370Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6417444Z ) 2025-05-07T20:32:16.6417692Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6417785Z def test_silu_mul_quant( 2025-05-07T20:32:16.6417857Z self, 2025-05-07T20:32:16.6417930Z T: int, 2025-05-07T20:32:16.6418005Z D: int, 2025-05-07T20:32:16.6418098Z scale_ub: Optional[float], 2025-05-07T20:32:16.6418189Z contiguous: bool, 2025-05-07T20:32:16.6418268Z compiled: bool, 2025-05-07T20:32:16.6418339Z ) -> None: 2025-05-07T20:32:16.6418433Z torch.manual_seed(2025) 2025-05-07T20:32:16.6418504Z 2025-05-07T20:32:16.6418672Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6418751Z 2025-05-07T20:32:16.6418840Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6418961Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6419047Z x = x_sign * x_clamp 2025-05-07T20:32:16.6419125Z x0 = x[:, :D] 2025-05-07T20:32:16.6419205Z x1 = x[:, D:] 2025-05-07T20:32:16.6419278Z 2025-05-07T20:32:16.6419356Z if contiguous: 2025-05-07T20:32:16.6419492Z x0 = x0.contiguous() 2025-05-07T20:32:16.6419583Z x1 = x1.contiguous() 2025-05-07T20:32:16.6419652Z 2025-05-07T20:32:16.6419744Z if scale_ub is not None: 2025-05-07T20:32:16.6419883Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6420020Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6420097Z ) 2025-05-07T20:32:16.6420172Z else: 2025-05-07T20:32:16.6420265Z scale_ub_tensor = None 2025-05-07T20:32:16.6420338Z 2025-05-07T20:32:16.6420508Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6420595Z op = silu_mul_quant 2025-05-07T20:32:16.6420683Z if compiled: 2025-05-07T20:32:16.6420780Z op = torch.compile(op) 2025-05-07T20:32:16.6420887Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6420961Z 2025-05-07T20:32:16.6421050Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6421054Z 2025-05-07T20:32:16.6421154Z moe/activation_test.py:117: 2025-05-07T20:32:16.6421287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6421386Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6421489Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6422007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6422109Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6422479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6422710Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6423065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6423158Z kernel = self.compile( 2025-05-07T20:32:16.6423558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6423739Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6423868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6423875Z 2025-05-07T20:32:16.6424088Z self = 2025-05-07T20:32:16.6424906Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6425512Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f94f7ade0c0>} 2025-05-07T20:32:16.6426307Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6426502Z context = 2025-05-07T20:32:16.6426512Z 2025-05-07T20:32:16.6426679Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6426957Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6427065Z module_map=module_map) 2025-05-07T20:32:16.6427231Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6427329Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6427407Z E ^ 2025-05-07T20:32:16.6427782Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6427786Z 2025-05-07T20:32:16.6428279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6428284Z 2025-05-07T20:32:16.6428388Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6428619Z self=, 2025-05-07T20:32:16.6428738Z T=2048, 2025-05-07T20:32:16.6428812Z D=5120, 2025-05-07T20:32:16.6428892Z scale_ub=None, 2025-05-07T20:32:16.6428979Z contiguous=True, 2025-05-07T20:32:16.6429062Z compiled=False, 2025-05-07T20:32:16.6429132Z ) 2025-05-07T20:32:16.6429361Z self = 2025-05-07T20:32:16.6429579Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.6429583Z 2025-05-07T20:32:16.6429661Z @given( 2025-05-07T20:32:16.6429783Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6429883Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6430003Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6430119Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6430232Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6430307Z ) 2025-05-07T20:32:16.6430562Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6430656Z def test_silu_mul_quant( 2025-05-07T20:32:16.6430732Z self, 2025-05-07T20:32:16.6430807Z T: int, 2025-05-07T20:32:16.6430887Z D: int, 2025-05-07T20:32:16.6430983Z scale_ub: Optional[float], 2025-05-07T20:32:16.6431076Z contiguous: bool, 2025-05-07T20:32:16.6431163Z compiled: bool, 2025-05-07T20:32:16.6431239Z ) -> None: 2025-05-07T20:32:16.6431330Z torch.manual_seed(2025) 2025-05-07T20:32:16.6431405Z 2025-05-07T20:32:16.6431577Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6431648Z 2025-05-07T20:32:16.6431741Z > x_sign = torch.sign(x) 2025-05-07T20:32:16.6433610Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6433622Z 2025-05-07T20:32:16.6433742Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:16.6433747Z 2025-05-07T20:32:16.6433846Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6434122Z self=, 2025-05-07T20:32:16.6434199Z T=16384, 2025-05-07T20:32:16.6434272Z D=5120, 2025-05-07T20:32:16.6434357Z scale_ub=None, 2025-05-07T20:32:16.6434441Z contiguous=True, 2025-05-07T20:32:16.6434524Z compiled=False, 2025-05-07T20:32:16.6434598Z ) 2025-05-07T20:32:16.6434821Z self = 2025-05-07T20:32:16.6435003Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.6435008Z 2025-05-07T20:32:16.6435086Z @given( 2025-05-07T20:32:16.6435202Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6435306Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6435422Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6435537Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6435652Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6435727Z ) 2025-05-07T20:32:16.6435976Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6436073Z def test_silu_mul_quant( 2025-05-07T20:32:16.6436190Z self, 2025-05-07T20:32:16.6436268Z T: int, 2025-05-07T20:32:16.6436347Z D: int, 2025-05-07T20:32:16.6436442Z scale_ub: Optional[float], 2025-05-07T20:32:16.6436567Z contiguous: bool, 2025-05-07T20:32:16.6436655Z compiled: bool, 2025-05-07T20:32:16.6436732Z ) -> None: 2025-05-07T20:32:16.6436829Z torch.manual_seed(2025) 2025-05-07T20:32:16.6436899Z 2025-05-07T20:32:16.6437067Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6438986Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6439031Z 2025-05-07T20:32:16.6439150Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.6439156Z 2025-05-07T20:32:16.6439260Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6439487Z self=, 2025-05-07T20:32:16.6439561Z T=4096, 2025-05-07T20:32:16.6439638Z D=5120, 2025-05-07T20:32:16.6439719Z scale_ub=None, 2025-05-07T20:32:16.6439805Z contiguous=True, 2025-05-07T20:32:16.6439890Z compiled=False, 2025-05-07T20:32:16.6439961Z ) 2025-05-07T20:32:16.6440186Z self = 2025-05-07T20:32:16.6440363Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.6440368Z 2025-05-07T20:32:16.6440444Z @given( 2025-05-07T20:32:16.6440564Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6440661Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6440772Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6440892Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6441010Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6441082Z ) 2025-05-07T20:32:16.6441338Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6441427Z def test_silu_mul_quant( 2025-05-07T20:32:16.6441507Z self, 2025-05-07T20:32:16.6441582Z T: int, 2025-05-07T20:32:16.6441658Z D: int, 2025-05-07T20:32:16.6441757Z scale_ub: Optional[float], 2025-05-07T20:32:16.6441843Z contiguous: bool, 2025-05-07T20:32:16.6441970Z compiled: bool, 2025-05-07T20:32:16.6442050Z ) -> None: 2025-05-07T20:32:16.6442142Z torch.manual_seed(2025) 2025-05-07T20:32:16.6442214Z 2025-05-07T20:32:16.6442387Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6444234Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6444244Z 2025-05-07T20:32:16.6444364Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.6444368Z 2025-05-07T20:32:16.6444471Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6444700Z self=, 2025-05-07T20:32:16.6444775Z T=2048, 2025-05-07T20:32:16.6444890Z D=5120, 2025-05-07T20:32:16.6444977Z scale_ub=None, 2025-05-07T20:32:16.6445061Z contiguous=False, 2025-05-07T20:32:16.6445144Z compiled=False, 2025-05-07T20:32:16.6445217Z ) 2025-05-07T20:32:16.6445480Z self = 2025-05-07T20:32:16.6445655Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:16.6445659Z 2025-05-07T20:32:16.6445739Z @given( 2025-05-07T20:32:16.6445856Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6445994Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6446106Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6446222Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6446341Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6446414Z ) 2025-05-07T20:32:16.6446668Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6446763Z def test_silu_mul_quant( 2025-05-07T20:32:16.6446838Z self, 2025-05-07T20:32:16.6446913Z T: int, 2025-05-07T20:32:16.6446992Z D: int, 2025-05-07T20:32:16.6447095Z scale_ub: Optional[float], 2025-05-07T20:32:16.6447187Z contiguous: bool, 2025-05-07T20:32:16.6447295Z compiled: bool, 2025-05-07T20:32:16.6447377Z ) -> None: 2025-05-07T20:32:16.6447492Z torch.manual_seed(2025) 2025-05-07T20:32:16.6447563Z 2025-05-07T20:32:16.6447730Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6449586Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6449595Z 2025-05-07T20:32:16.6449715Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.6449719Z 2025-05-07T20:32:16.6449821Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6450047Z self=, 2025-05-07T20:32:16.6450122Z T=4096, 2025-05-07T20:32:16.6450203Z D=7168, 2025-05-07T20:32:16.6450288Z scale_ub=None, 2025-05-07T20:32:16.6450370Z contiguous=True, 2025-05-07T20:32:16.6450455Z compiled=True, 2025-05-07T20:32:16.6450526Z ) 2025-05-07T20:32:16.6450797Z self = 2025-05-07T20:32:16.6450971Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.6450976Z 2025-05-07T20:32:16.6451052Z @given( 2025-05-07T20:32:16.6451171Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6451268Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6451381Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6451503Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6451614Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6451686Z ) 2025-05-07T20:32:16.6452005Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6452100Z def test_silu_mul_quant( 2025-05-07T20:32:16.6452179Z self, 2025-05-07T20:32:16.6452254Z T: int, 2025-05-07T20:32:16.6452329Z D: int, 2025-05-07T20:32:16.6452431Z scale_ub: Optional[float], 2025-05-07T20:32:16.6452522Z contiguous: bool, 2025-05-07T20:32:16.6452605Z compiled: bool, 2025-05-07T20:32:16.6452686Z ) -> None: 2025-05-07T20:32:16.6452825Z torch.manual_seed(2025) 2025-05-07T20:32:16.6452897Z 2025-05-07T20:32:16.6453067Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6454926Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6455029Z 2025-05-07T20:32:16.6455155Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.6455160Z 2025-05-07T20:32:16.6455259Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6455492Z self=, 2025-05-07T20:32:16.6455567Z T=2048, 2025-05-07T20:32:16.6455640Z D=5120, 2025-05-07T20:32:16.6455724Z scale_ub=1200.0, 2025-05-07T20:32:16.6455810Z contiguous=False, 2025-05-07T20:32:16.6455895Z compiled=False, 2025-05-07T20:32:16.6455970Z ) 2025-05-07T20:32:16.6456191Z self = 2025-05-07T20:32:16.6456369Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:16.6456374Z 2025-05-07T20:32:16.6456456Z @given( 2025-05-07T20:32:16.6456578Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6456679Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6456791Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6456908Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6457024Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6457098Z ) 2025-05-07T20:32:16.6457352Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6457449Z def test_silu_mul_quant( 2025-05-07T20:32:16.6457523Z self, 2025-05-07T20:32:16.6457599Z T: int, 2025-05-07T20:32:16.6457676Z D: int, 2025-05-07T20:32:16.6457770Z scale_ub: Optional[float], 2025-05-07T20:32:16.6457855Z contiguous: bool, 2025-05-07T20:32:16.6457944Z compiled: bool, 2025-05-07T20:32:16.6458017Z ) -> None: 2025-05-07T20:32:16.6458114Z torch.manual_seed(2025) 2025-05-07T20:32:16.6458190Z 2025-05-07T20:32:16.6458356Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6460253Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6460262Z 2025-05-07T20:32:16.6460380Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.6460384Z 2025-05-07T20:32:16.6460486Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6460712Z self=, 2025-05-07T20:32:16.6460792Z T=4096, 2025-05-07T20:32:16.6460872Z D=7168, 2025-05-07T20:32:16.6460953Z scale_ub=1200.0, 2025-05-07T20:32:16.6461034Z contiguous=True, 2025-05-07T20:32:16.6461118Z compiled=False, 2025-05-07T20:32:16.6461192Z ) 2025-05-07T20:32:16.6461419Z self = 2025-05-07T20:32:16.6461636Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.6461641Z 2025-05-07T20:32:16.6461718Z @given( 2025-05-07T20:32:16.6461839Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6461974Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6462086Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6462207Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6462318Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6462391Z ) 2025-05-07T20:32:16.6462644Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6462776Z def test_silu_mul_quant( 2025-05-07T20:32:16.6462853Z self, 2025-05-07T20:32:16.6462927Z T: int, 2025-05-07T20:32:16.6463004Z D: int, 2025-05-07T20:32:16.6463103Z scale_ub: Optional[float], 2025-05-07T20:32:16.6463189Z contiguous: bool, 2025-05-07T20:32:16.6463271Z compiled: bool, 2025-05-07T20:32:16.6463352Z ) -> None: 2025-05-07T20:32:16.6463445Z torch.manual_seed(2025) 2025-05-07T20:32:16.6463517Z 2025-05-07T20:32:16.6463686Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6465542Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6465552Z 2025-05-07T20:32:16.6465670Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.6465675Z 2025-05-07T20:32:16.6465777Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6466006Z self=, 2025-05-07T20:32:16.6466083Z T=16384, 2025-05-07T20:32:16.6466157Z D=7168, 2025-05-07T20:32:16.6466242Z scale_ub=None, 2025-05-07T20:32:16.6466326Z contiguous=False, 2025-05-07T20:32:16.6466410Z compiled=True, 2025-05-07T20:32:16.6466485Z ) 2025-05-07T20:32:16.6466706Z self = 2025-05-07T20:32:16.6466883Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:16.6466891Z 2025-05-07T20:32:16.6466970Z @given( 2025-05-07T20:32:16.6467087Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6467187Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6467345Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6467461Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6467577Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6467664Z ) 2025-05-07T20:32:16.6467950Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6468045Z def test_silu_mul_quant( 2025-05-07T20:32:16.6468122Z self, 2025-05-07T20:32:16.6468198Z T: int, 2025-05-07T20:32:16.6468276Z D: int, 2025-05-07T20:32:16.6468372Z scale_ub: Optional[float], 2025-05-07T20:32:16.6468457Z contiguous: bool, 2025-05-07T20:32:16.6468546Z compiled: bool, 2025-05-07T20:32:16.6468620Z ) -> None: 2025-05-07T20:32:16.6468720Z torch.manual_seed(2025) 2025-05-07T20:32:16.6468792Z 2025-05-07T20:32:16.6468958Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6470860Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6470902Z 2025-05-07T20:32:16.6471020Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.6471025Z 2025-05-07T20:32:16.6471127Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6471392Z self=, 2025-05-07T20:32:16.6471466Z T=4096, 2025-05-07T20:32:16.6471546Z D=7168, 2025-05-07T20:32:16.6471625Z scale_ub=None, 2025-05-07T20:32:16.6471711Z contiguous=True, 2025-05-07T20:32:16.6471794Z compiled=False, 2025-05-07T20:32:16.6471865Z ) 2025-05-07T20:32:16.6472092Z self = 2025-05-07T20:32:16.6472266Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.6472270Z 2025-05-07T20:32:16.6472346Z @given( 2025-05-07T20:32:16.6472464Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6472563Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6472675Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6472791Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6472907Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6472983Z ) 2025-05-07T20:32:16.6473236Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6473326Z def test_silu_mul_quant( 2025-05-07T20:32:16.6473403Z self, 2025-05-07T20:32:16.6473483Z T: int, 2025-05-07T20:32:16.6473556Z D: int, 2025-05-07T20:32:16.6473655Z scale_ub: Optional[float], 2025-05-07T20:32:16.6473746Z contiguous: bool, 2025-05-07T20:32:16.6473830Z compiled: bool, 2025-05-07T20:32:16.6473910Z ) -> None: 2025-05-07T20:32:16.6474002Z torch.manual_seed(2025) 2025-05-07T20:32:16.6474072Z 2025-05-07T20:32:16.6474246Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6476149Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6476158Z 2025-05-07T20:32:16.6476279Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.6476286Z 2025-05-07T20:32:16.6476385Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6476617Z self=, 2025-05-07T20:32:16.6476695Z T=16384, 2025-05-07T20:32:16.6476770Z D=7168, 2025-05-07T20:32:16.6476855Z scale_ub=None, 2025-05-07T20:32:16.6476938Z contiguous=True, 2025-05-07T20:32:16.6477021Z compiled=False, 2025-05-07T20:32:16.6477095Z ) 2025-05-07T20:32:16.6477315Z self = 2025-05-07T20:32:16.6477494Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.6477499Z 2025-05-07T20:32:16.6477576Z @given( 2025-05-07T20:32:16.6477691Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6477797Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6477909Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6478083Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6478214Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6478304Z ) 2025-05-07T20:32:16.6478557Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6478691Z def test_silu_mul_quant( 2025-05-07T20:32:16.6478766Z self, 2025-05-07T20:32:16.6478842Z T: int, 2025-05-07T20:32:16.6478919Z D: int, 2025-05-07T20:32:16.6479014Z scale_ub: Optional[float], 2025-05-07T20:32:16.6479101Z contiguous: bool, 2025-05-07T20:32:16.6479230Z compiled: bool, 2025-05-07T20:32:16.6479306Z ) -> None: 2025-05-07T20:32:16.6479403Z torch.manual_seed(2025) 2025-05-07T20:32:16.6479475Z 2025-05-07T20:32:16.6479644Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6481503Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6481510Z 2025-05-07T20:32:16.6481626Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.6481631Z 2025-05-07T20:32:16.6481736Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6481962Z self=, 2025-05-07T20:32:16.6482039Z T=16384, 2025-05-07T20:32:16.6482118Z D=7168, 2025-05-07T20:32:16.6482202Z scale_ub=1200.0, 2025-05-07T20:32:16.6482285Z contiguous=True, 2025-05-07T20:32:16.6482370Z compiled=False, 2025-05-07T20:32:16.6482442Z ) 2025-05-07T20:32:16.6482669Z self = 2025-05-07T20:32:16.6482849Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.6482856Z 2025-05-07T20:32:16.6482929Z @given( 2025-05-07T20:32:16.6483048Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6483144Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6483259Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6483376Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6483490Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6483562Z ) 2025-05-07T20:32:16.6483818Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6483954Z def test_silu_mul_quant( 2025-05-07T20:32:16.6484035Z self, 2025-05-07T20:32:16.6484109Z T: int, 2025-05-07T20:32:16.6484184Z D: int, 2025-05-07T20:32:16.6484283Z scale_ub: Optional[float], 2025-05-07T20:32:16.6484371Z contiguous: bool, 2025-05-07T20:32:16.6484455Z compiled: bool, 2025-05-07T20:32:16.6484536Z ) -> None: 2025-05-07T20:32:16.6484632Z torch.manual_seed(2025) 2025-05-07T20:32:16.6484703Z 2025-05-07T20:32:16.6484873Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6486732Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6486781Z 2025-05-07T20:32:16.6486902Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.6486907Z 2025-05-07T20:32:16.6487007Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6487261Z self=, 2025-05-07T20:32:16.6487416Z T=128, 2025-05-07T20:32:16.6487497Z D=5120, 2025-05-07T20:32:16.6487579Z scale_ub=1200.0, 2025-05-07T20:32:16.6487664Z contiguous=False, 2025-05-07T20:32:16.6487748Z compiled=False, 2025-05-07T20:32:16.6487827Z ) 2025-05-07T20:32:16.6488090Z self = 2025-05-07T20:32:16.6488389Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:16.6488397Z 2025-05-07T20:32:16.6488515Z @given( 2025-05-07T20:32:16.6488640Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6488742Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6488857Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6488974Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6489089Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6489165Z ) 2025-05-07T20:32:16.6489416Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6494086Z def test_silu_mul_quant( 2025-05-07T20:32:16.6494184Z self, 2025-05-07T20:32:16.6494264Z T: int, 2025-05-07T20:32:16.6494346Z D: int, 2025-05-07T20:32:16.6494449Z scale_ub: Optional[float], 2025-05-07T20:32:16.6494550Z contiguous: bool, 2025-05-07T20:32:16.6494643Z compiled: bool, 2025-05-07T20:32:16.6494724Z ) -> None: 2025-05-07T20:32:16.6494820Z torch.manual_seed(2025) 2025-05-07T20:32:16.6494899Z 2025-05-07T20:32:16.6495076Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6495153Z 2025-05-07T20:32:16.6495248Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6495375Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6495467Z x = x_sign * x_clamp 2025-05-07T20:32:16.6495550Z x0 = x[:, :D] 2025-05-07T20:32:16.6495633Z x1 = x[:, D:] 2025-05-07T20:32:16.6495711Z 2025-05-07T20:32:16.6495795Z if contiguous: 2025-05-07T20:32:16.6495886Z x0 = x0.contiguous() 2025-05-07T20:32:16.6495977Z x1 = x1.contiguous() 2025-05-07T20:32:16.6496051Z 2025-05-07T20:32:16.6496140Z if scale_ub is not None: 2025-05-07T20:32:16.6496251Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6496387Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6496465Z ) 2025-05-07T20:32:16.6496542Z else: 2025-05-07T20:32:16.6496714Z scale_ub_tensor = None 2025-05-07T20:32:16.6496793Z 2025-05-07T20:32:16.6496923Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6497015Z op = silu_mul_quant 2025-05-07T20:32:16.6497103Z if compiled: 2025-05-07T20:32:16.6497204Z op = torch.compile(op) 2025-05-07T20:32:16.6497311Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6497388Z 2025-05-07T20:32:16.6497478Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6497483Z 2025-05-07T20:32:16.6497584Z moe/activation_test.py:117: 2025-05-07T20:32:16.6497720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6497820Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6497924Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6498454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6498550Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6498976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6499207Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6499560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6499695Z kernel = self.compile( 2025-05-07T20:32:16.6500098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6500279Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6500454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6500459Z 2025-05-07T20:32:16.6500670Z self = 2025-05-07T20:32:16.6501487Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6502007Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f94f7910cc0>} 2025-05-07T20:32:16.6502789Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6502984Z context = 2025-05-07T20:32:16.6502990Z 2025-05-07T20:32:16.6503156Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6503431Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6503538Z module_map=module_map) 2025-05-07T20:32:16.6503707Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6503809Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6503885Z E ^ 2025-05-07T20:32:16.6504256Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6504263Z 2025-05-07T20:32:16.6504693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6504698Z 2025-05-07T20:32:16.6504800Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6505032Z self=, 2025-05-07T20:32:16.6505113Z T=2048, 2025-05-07T20:32:16.6505190Z D=7168, 2025-05-07T20:32:16.6505272Z scale_ub=None, 2025-05-07T20:32:16.6505356Z contiguous=False, 2025-05-07T20:32:16.6505491Z compiled=False, 2025-05-07T20:32:16.6505565Z ) 2025-05-07T20:32:16.6505789Z self = 2025-05-07T20:32:16.6505975Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:16.6505980Z 2025-05-07T20:32:16.6506057Z @given( 2025-05-07T20:32:16.6506469Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6506599Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6506713Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6506832Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6506945Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6507017Z ) 2025-05-07T20:32:16.6507274Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6507366Z def test_silu_mul_quant( 2025-05-07T20:32:16.6507443Z self, 2025-05-07T20:32:16.6507521Z T: int, 2025-05-07T20:32:16.6507599Z D: int, 2025-05-07T20:32:16.6507696Z scale_ub: Optional[float], 2025-05-07T20:32:16.6507787Z contiguous: bool, 2025-05-07T20:32:16.6507962Z compiled: bool, 2025-05-07T20:32:16.6508042Z ) -> None: 2025-05-07T20:32:16.6508139Z torch.manual_seed(2025) 2025-05-07T20:32:16.6508213Z 2025-05-07T20:32:16.6508388Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6510306Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6510371Z 2025-05-07T20:32:16.6510494Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.6510499Z 2025-05-07T20:32:16.6510603Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6510833Z self=, 2025-05-07T20:32:16.6510915Z T=128, 2025-05-07T20:32:16.6510994Z D=7168, 2025-05-07T20:32:16.6511077Z scale_ub=1200.0, 2025-05-07T20:32:16.6511167Z contiguous=True, 2025-05-07T20:32:16.6511252Z compiled=True, 2025-05-07T20:32:16.6511324Z ) 2025-05-07T20:32:16.6511549Z self = 2025-05-07T20:32:16.6511721Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.6511728Z 2025-05-07T20:32:16.6511806Z @given( 2025-05-07T20:32:16.6511924Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6512023Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6512141Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6512256Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6512371Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6512451Z ) 2025-05-07T20:32:16.6512701Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6512800Z def test_silu_mul_quant( 2025-05-07T20:32:16.6512875Z self, 2025-05-07T20:32:16.6512950Z T: int, 2025-05-07T20:32:16.6513029Z D: int, 2025-05-07T20:32:16.6513124Z scale_ub: Optional[float], 2025-05-07T20:32:16.6513212Z contiguous: bool, 2025-05-07T20:32:16.6513299Z compiled: bool, 2025-05-07T20:32:16.6513378Z ) -> None: 2025-05-07T20:32:16.6513471Z torch.manual_seed(2025) 2025-05-07T20:32:16.6513546Z 2025-05-07T20:32:16.6513715Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6513855Z 2025-05-07T20:32:16.6513954Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6514077Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6514165Z x = x_sign * x_clamp 2025-05-07T20:32:16.6514250Z x0 = x[:, :D] 2025-05-07T20:32:16.6514327Z x1 = x[:, D:] 2025-05-07T20:32:16.6514400Z 2025-05-07T20:32:16.6514482Z if contiguous: 2025-05-07T20:32:16.6514576Z x0 = x0.contiguous() 2025-05-07T20:32:16.6514666Z x1 = x1.contiguous() 2025-05-07T20:32:16.6514737Z 2025-05-07T20:32:16.6514826Z if scale_ub is not None: 2025-05-07T20:32:16.6514934Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6515071Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6515149Z ) 2025-05-07T20:32:16.6515227Z else: 2025-05-07T20:32:16.6515322Z scale_ub_tensor = None 2025-05-07T20:32:16.6515393Z 2025-05-07T20:32:16.6515528Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6515616Z op = silu_mul_quant 2025-05-07T20:32:16.6515701Z if compiled: 2025-05-07T20:32:16.6515844Z op = torch.compile(op) 2025-05-07T20:32:16.6515951Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6516026Z 2025-05-07T20:32:16.6516115Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6516158Z 2025-05-07T20:32:16.6516255Z moe/activation_test.py:117: 2025-05-07T20:32:16.6516388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6516489Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6516588Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6516974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.6517108Z return fn(*args, **kwargs) 2025-05-07T20:32:16.6517626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6517722Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6518093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6518325Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6518676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6518770Z kernel = self.compile( 2025-05-07T20:32:16.6519166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6519343Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6519480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6519484Z 2025-05-07T20:32:16.6519693Z self = 2025-05-07T20:32:16.6520502Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6521023Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f94f7911a80>} 2025-05-07T20:32:16.6521803Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6522001Z context = 2025-05-07T20:32:16.6522005Z 2025-05-07T20:32:16.6522172Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6522492Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6522601Z module_map=module_map) 2025-05-07T20:32:16.6522767Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6522870Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6522949Z E ^ 2025-05-07T20:32:16.6523316Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6523323Z 2025-05-07T20:32:16.6523750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6523755Z 2025-05-07T20:32:16.6523857Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6524089Z self=, 2025-05-07T20:32:16.6524164Z T=128, 2025-05-07T20:32:16.6524240Z D=7168, 2025-05-07T20:32:16.6524326Z scale_ub=1200.0, 2025-05-07T20:32:16.6524415Z contiguous=True, 2025-05-07T20:32:16.6524497Z compiled=False, 2025-05-07T20:32:16.6524571Z ) 2025-05-07T20:32:16.6524862Z self = 2025-05-07T20:32:16.6525039Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.6525047Z 2025-05-07T20:32:16.6525122Z @given( 2025-05-07T20:32:16.6525277Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6525377Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6525492Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6525607Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6525727Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6525840Z ) 2025-05-07T20:32:16.6526091Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6526185Z def test_silu_mul_quant( 2025-05-07T20:32:16.6526261Z self, 2025-05-07T20:32:16.6526335Z T: int, 2025-05-07T20:32:16.6526416Z D: int, 2025-05-07T20:32:16.6526512Z scale_ub: Optional[float], 2025-05-07T20:32:16.6526605Z contiguous: bool, 2025-05-07T20:32:16.6526690Z compiled: bool, 2025-05-07T20:32:16.6526768Z ) -> None: 2025-05-07T20:32:16.6526867Z torch.manual_seed(2025) 2025-05-07T20:32:16.6526941Z 2025-05-07T20:32:16.6527109Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6527185Z 2025-05-07T20:32:16.6527274Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6527396Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6529258Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6529267Z 2025-05-07T20:32:16.6529385Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:16.6529393Z 2025-05-07T20:32:16.6529500Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6529726Z self=, 2025-05-07T20:32:16.6529803Z T=128, 2025-05-07T20:32:16.6529878Z D=5120, 2025-05-07T20:32:16.6529960Z scale_ub=1200.0, 2025-05-07T20:32:16.6530046Z contiguous=True, 2025-05-07T20:32:16.6530130Z compiled=True, 2025-05-07T20:32:16.6530201Z ) 2025-05-07T20:32:16.6530428Z self = 2025-05-07T20:32:16.6530640Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.6530645Z 2025-05-07T20:32:16.6530720Z @given( 2025-05-07T20:32:16.6530844Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6530940Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6531054Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6531168Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6531282Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6531359Z ) 2025-05-07T20:32:16.6531609Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6531702Z def test_silu_mul_quant( 2025-05-07T20:32:16.6531873Z self, 2025-05-07T20:32:16.6531956Z T: int, 2025-05-07T20:32:16.6532032Z D: int, 2025-05-07T20:32:16.6532134Z scale_ub: Optional[float], 2025-05-07T20:32:16.6532222Z contiguous: bool, 2025-05-07T20:32:16.6532308Z compiled: bool, 2025-05-07T20:32:16.6532390Z ) -> None: 2025-05-07T20:32:16.6532483Z torch.manual_seed(2025) 2025-05-07T20:32:16.6532557Z 2025-05-07T20:32:16.6532778Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6532850Z 2025-05-07T20:32:16.6532943Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6533066Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6534949Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6535003Z 2025-05-07T20:32:16.6535120Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:16.6535125Z 2025-05-07T20:32:16.6535226Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6535459Z self=, 2025-05-07T20:32:16.6535536Z T=128, 2025-05-07T20:32:16.6535611Z D=7168, 2025-05-07T20:32:16.6535699Z scale_ub=None, 2025-05-07T20:32:16.6535784Z contiguous=True, 2025-05-07T20:32:16.6535868Z compiled=True, 2025-05-07T20:32:16.6535939Z ) 2025-05-07T20:32:16.6536161Z self = 2025-05-07T20:32:16.6536331Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.6536338Z 2025-05-07T20:32:16.6536414Z @given( 2025-05-07T20:32:16.6536531Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6536632Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6536745Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6536861Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6536979Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6537052Z ) 2025-05-07T20:32:16.6537304Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6537396Z def test_silu_mul_quant( 2025-05-07T20:32:16.6537474Z self, 2025-05-07T20:32:16.6537551Z T: int, 2025-05-07T20:32:16.6537625Z D: int, 2025-05-07T20:32:16.6537725Z scale_ub: Optional[float], 2025-05-07T20:32:16.6537817Z contiguous: bool, 2025-05-07T20:32:16.6537901Z compiled: bool, 2025-05-07T20:32:16.6537978Z ) -> None: 2025-05-07T20:32:16.6538079Z torch.manual_seed(2025) 2025-05-07T20:32:16.6538151Z 2025-05-07T20:32:16.6538317Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6540217Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6540226Z 2025-05-07T20:32:16.6540344Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.6540484Z =============================== warnings summary =============================== 2025-05-07T20:32:16.6540807Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:16.6541126Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:16.6541438Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:16.6542389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:16.6542665Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:16.6542669Z 2025-05-07T20:32:16.6542885Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:16.6543056Z ================= 1 failed, 1 deselected, 3 warnings in 14.99s ================= 2025-05-07T20:32:18.2737500Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:18.3378220Z [EXEC] [ATTEMPT 0/2] Command attempt failed. 2025-05-07T20:32:18.3378484Z 2025-05-07T20:32:20.3397261Z [EXEC] [ATTEMPT 1/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:22.4834593Z ============================= test session starts ============================== 2025-05-07T20:32:22.4835854Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:22.4836911Z cachedir: .pytest_cache 2025-05-07T20:32:22.4838080Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:22.4839430Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:22.4839846Z plugins: hypothesis-6.131.14 2025-05-07T20:32:24.0927425Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:24.2000446Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:24.2001554Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:24.2002120Z 2025-05-07T20:32:26.5492578Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.5493939Z self=, 2025-05-07T20:32:26.5494793Z T=1, 2025-05-07T20:32:26.5495165Z D=5120, 2025-05-07T20:32:26.5495535Z scale_ub=None, 2025-05-07T20:32:26.5495946Z contiguous=True, 2025-05-07T20:32:26.5496382Z compiled=True, 2025-05-07T20:32:26.5496777Z ) 2025-05-07T20:32:26.5497423Z self = 2025-05-07T20:32:26.5498424Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:26.5498955Z 2025-05-07T20:32:26.5499110Z @given( 2025-05-07T20:32:26.5499970Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.5500294Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.5500618Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.5500952Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.5501291Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.5501587Z ) 2025-05-07T20:32:26.5501943Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.5502396Z def test_silu_mul_quant( 2025-05-07T20:32:26.5502647Z self, 2025-05-07T20:32:26.5502838Z T: int, 2025-05-07T20:32:26.5503042Z D: int, 2025-05-07T20:32:26.5503265Z scale_ub: Optional[float], 2025-05-07T20:32:26.5503540Z contiguous: bool, 2025-05-07T20:32:26.5503787Z compiled: bool, 2025-05-07T20:32:26.5504023Z ) -> None: 2025-05-07T20:32:26.5504239Z torch.manual_seed(2025) 2025-05-07T20:32:26.5504487Z 2025-05-07T20:32:26.5504775Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.5505119Z 2025-05-07T20:32:26.5505419Z x_sign = torch.sign(x) 2025-05-07T20:32:26.5505728Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.5506049Z x = x_sign * x_clamp 2025-05-07T20:32:26.5506551Z x0 = x[:, :D] 2025-05-07T20:32:26.5506872Z x1 = x[:, D:] 2025-05-07T20:32:26.5507086Z 2025-05-07T20:32:26.5507270Z if contiguous: 2025-05-07T20:32:26.5507508Z x0 = x0.contiguous() 2025-05-07T20:32:26.5507773Z x1 = x1.contiguous() 2025-05-07T20:32:26.5508010Z 2025-05-07T20:32:26.5508206Z if scale_ub is not None: 2025-05-07T20:32:26.5508579Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.5508915Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.5509230Z ) 2025-05-07T20:32:26.5509426Z else: 2025-05-07T20:32:26.5509637Z scale_ub_tensor = None 2025-05-07T20:32:26.5509901Z 2025-05-07T20:32:26.5510181Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.5510504Z op = silu_mul_quant 2025-05-07T20:32:26.5510764Z if compiled: 2025-05-07T20:32:26.5511019Z op = torch.compile(op) 2025-05-07T20:32:26.5511322Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.5511600Z 2025-05-07T20:32:26.5511799Z y_fp8, y_scale = fn() 2025-05-07T20:32:26.5512092Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:26.5512383Z 2025-05-07T20:32:26.5512627Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.5512973Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:26.5513270Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:26.5513596Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:26.5513966Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:26.5514278Z 2025-05-07T20:32:26.5514483Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:26.5514687Z 2025-05-07T20:32:26.5514792Z moe/activation_test.py:126: 2025-05-07T20:32:26.5515099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.5515440Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:26.5515779Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:26.5516761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:26.5517542Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:26.5518106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.5518814Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.5519613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:26.5520363Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:26.5521164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:26.5521823Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:26.5522446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:26.5522973Z fn() 2025-05-07T20:32:26.5523493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:26.5524095Z self.fn.run( 2025-05-07T20:32:26.5524569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.5525115Z kernel = self.compile( 2025-05-07T20:32:26.5525675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.5526424Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.5526829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.5527071Z 2025-05-07T20:32:26.5527286Z self = 2025-05-07T20:32:26.5528457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.5529948Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8aef86dc60>} 2025-05-07T20:32:26.5531337Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.5532471Z context = 2025-05-07T20:32:26.5532779Z 2025-05-07T20:32:26.5532951Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.5533493Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.5533970Z module_map=module_map) 2025-05-07T20:32:26.5534347Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.5534712Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:26.5534982Z E ^ 2025-05-07T20:32:26.5535463Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.5535937Z 2025-05-07T20:32:26.5536370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.5536901Z 2025-05-07T20:32:26.5537016Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.5537436Z self=, 2025-05-07T20:32:26.5537849Z T=2048, 2025-05-07T20:32:26.5538040Z D=5120, 2025-05-07T20:32:26.5538233Z scale_ub=1200.0, 2025-05-07T20:32:26.5538458Z contiguous=True, 2025-05-07T20:32:26.5538684Z compiled=False, 2025-05-07T20:32:26.5538887Z ) 2025-05-07T20:32:27.2863286Z self = 2025-05-07T20:32:27.2863882Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:27.2864200Z 2025-05-07T20:32:27.2864282Z @given( 2025-05-07T20:32:27.2864527Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.2864844Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.2865448Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.2865795Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.2866136Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.2866438Z ) 2025-05-07T20:32:27.2866806Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.2867271Z def test_silu_mul_quant( 2025-05-07T20:32:27.2867529Z self, 2025-05-07T20:32:27.2867736Z T: int, 2025-05-07T20:32:27.2867945Z D: int, 2025-05-07T20:32:27.2868170Z scale_ub: Optional[float], 2025-05-07T20:32:27.2868461Z contiguous: bool, 2025-05-07T20:32:27.2868715Z compiled: bool, 2025-05-07T20:32:27.2868950Z ) -> None: 2025-05-07T20:32:27.2869181Z torch.manual_seed(2025) 2025-05-07T20:32:27.2869434Z 2025-05-07T20:32:27.2869713Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.2870069Z 2025-05-07T20:32:27.2870280Z x_sign = torch.sign(x) 2025-05-07T20:32:27.2870577Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.2870898Z x = x_sign * x_clamp 2025-05-07T20:32:27.2871250Z x0 = x[:, :D] 2025-05-07T20:32:27.2871473Z x1 = x[:, D:] 2025-05-07T20:32:27.2871691Z 2025-05-07T20:32:27.2871888Z if contiguous: 2025-05-07T20:32:27.2872133Z x0 = x0.contiguous() 2025-05-07T20:32:27.2872473Z x1 = x1.contiguous() 2025-05-07T20:32:27.2872722Z 2025-05-07T20:32:27.2872923Z if scale_ub is not None: 2025-05-07T20:32:27.2873201Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.2873548Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.2873989Z ) 2025-05-07T20:32:27.2874192Z else: 2025-05-07T20:32:27.2874415Z scale_ub_tensor = None 2025-05-07T20:32:27.2874683Z 2025-05-07T20:32:27.2874924Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.2875253Z op = silu_mul_quant 2025-05-07T20:32:27.2875517Z if compiled: 2025-05-07T20:32:27.2875768Z op = torch.compile(op) 2025-05-07T20:32:27.2876081Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.2876370Z 2025-05-07T20:32:27.2876567Z > y_fp8, y_scale = fn() 2025-05-07T20:32:27.2876748Z 2025-05-07T20:32:27.2876858Z moe/activation_test.py:117: 2025-05-07T20:32:27.2877168Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.2877513Z moe/activation_test.py:115: in fn 2025-05-07T20:32:27.2877803Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.2878529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:27.2879253Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:27.2879812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.2880528Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.2881225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.2881788Z kernel = self.compile( 2025-05-07T20:32:27.2882351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.2883039Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.2883458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.2883695Z 2025-05-07T20:32:27.2883917Z self = 2025-05-07T20:32:27.2885092Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.2886556Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8aef6c8220>} 2025-05-07T20:32:27.2887976Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.2889058Z context = 2025-05-07T20:32:27.2889360Z 2025-05-07T20:32:27.2889540Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.2890083Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.2890625Z module_map=module_map) 2025-05-07T20:32:27.2891010Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.2899355Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.2899637Z E ^ 2025-05-07T20:32:27.2900204Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.2900674Z 2025-05-07T20:32:27.2901117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.2901690Z 2025-05-07T20:32:27.2901808Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.2902234Z self=, 2025-05-07T20:32:27.2902657Z T=2048, 2025-05-07T20:32:27.2902860Z D=5120, 2025-05-07T20:32:27.2903104Z scale_ub=1200.0, 2025-05-07T20:32:27.2903340Z contiguous=True, 2025-05-07T20:32:27.2903576Z compiled=True, 2025-05-07T20:32:27.2903791Z ) 2025-05-07T20:32:27.2904130Z self = 2025-05-07T20:32:27.2904652Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:27.2904937Z 2025-05-07T20:32:27.2905030Z @given( 2025-05-07T20:32:27.2905276Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.2905612Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.2905933Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.2906615Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.2906962Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.2907258Z ) 2025-05-07T20:32:27.2907623Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.2908077Z def test_silu_mul_quant( 2025-05-07T20:32:27.2908337Z self, 2025-05-07T20:32:27.2908544Z T: int, 2025-05-07T20:32:27.2908746Z D: int, 2025-05-07T20:32:27.2908977Z scale_ub: Optional[float], 2025-05-07T20:32:27.2909264Z contiguous: bool, 2025-05-07T20:32:27.2909510Z compiled: bool, 2025-05-07T20:32:27.2909746Z ) -> None: 2025-05-07T20:32:27.2909975Z torch.manual_seed(2025) 2025-05-07T20:32:27.2910249Z 2025-05-07T20:32:27.2910558Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.2910914Z 2025-05-07T20:32:27.2911112Z x_sign = torch.sign(x) 2025-05-07T20:32:27.2911419Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.2911746Z x = x_sign * x_clamp 2025-05-07T20:32:27.2912002Z x0 = x[:, :D] 2025-05-07T20:32:27.2912226Z x1 = x[:, D:] 2025-05-07T20:32:27.2912447Z 2025-05-07T20:32:27.2912643Z if contiguous: 2025-05-07T20:32:27.2912882Z x0 = x0.contiguous() 2025-05-07T20:32:27.2913154Z x1 = x1.contiguous() 2025-05-07T20:32:27.2913405Z 2025-05-07T20:32:27.2913599Z if scale_ub is not None: 2025-05-07T20:32:27.2913903Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.2914341Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.2914664Z ) 2025-05-07T20:32:27.2914860Z else: 2025-05-07T20:32:27.2915082Z scale_ub_tensor = None 2025-05-07T20:32:27.2915346Z 2025-05-07T20:32:27.2915584Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.2915913Z op = silu_mul_quant 2025-05-07T20:32:27.2916181Z if compiled: 2025-05-07T20:32:27.2916435Z op = torch.compile(op) 2025-05-07T20:32:27.2916741Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.2917029Z 2025-05-07T20:32:27.2917226Z y_fp8, y_scale = fn() 2025-05-07T20:32:27.2917525Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:27.2917832Z 2025-05-07T20:32:27.2918079Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.2918422Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:27.2918732Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:27.2919063Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:27.2919504Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:27.2919836Z 2025-05-07T20:32:27.2920051Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:27.2920277Z 2025-05-07T20:32:27.2920395Z moe/activation_test.py:126: 2025-05-07T20:32:27.2920777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.2921131Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:27.2921476Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:27.2922297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:27.2923149Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:27.2923728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.2924434Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.2925155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:27.2925905Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:27.2926672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:27.2927332Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:27.2927961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:27.2928508Z fn() 2025-05-07T20:32:27.2929036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:27.2929639Z self.fn.run( 2025-05-07T20:32:27.2930134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.2930697Z kernel = self.compile( 2025-05-07T20:32:27.2931268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.2932052Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.2932473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.2932712Z 2025-05-07T20:32:27.2932936Z self = 2025-05-07T20:32:27.2934081Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.2935589Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8aef6c96c0>} 2025-05-07T20:32:27.2937024Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.2938117Z context = 2025-05-07T20:32:27.2938424Z 2025-05-07T20:32:27.2938605Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.2939149Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.2939643Z module_map=module_map) 2025-05-07T20:32:27.2940025Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.2940394Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:27.2940673Z E ^ 2025-05-07T20:32:27.2941165Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.2941642Z 2025-05-07T20:32:27.2942138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.2942685Z 2025-05-07T20:32:27.2942792Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.2943266Z self=, 2025-05-07T20:32:27.2943691Z T=16384, 2025-05-07T20:32:27.2943886Z D=7168, 2025-05-07T20:32:27.2944089Z scale_ub=1200.0, 2025-05-07T20:32:27.2944324Z contiguous=False, 2025-05-07T20:32:27.2944554Z compiled=False, 2025-05-07T20:32:27.2944812Z ) 2025-05-07T20:32:28.0224034Z self = 2025-05-07T20:32:28.0224641Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:28.0224942Z 2025-05-07T20:32:28.0225051Z @given( 2025-05-07T20:32:28.0225291Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.0225604Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.0225923Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.0226258Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.0226598Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.0226889Z ) 2025-05-07T20:32:28.0227248Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.0227707Z def test_silu_mul_quant( 2025-05-07T20:32:28.0227954Z self, 2025-05-07T20:32:28.0228155Z T: int, 2025-05-07T20:32:28.0228361Z D: int, 2025-05-07T20:32:28.0228585Z scale_ub: Optional[float], 2025-05-07T20:32:28.0228866Z contiguous: bool, 2025-05-07T20:32:28.0229117Z compiled: bool, 2025-05-07T20:32:28.0229347Z ) -> None: 2025-05-07T20:32:28.0229572Z torch.manual_seed(2025) 2025-05-07T20:32:28.0229820Z 2025-05-07T20:32:28.0230095Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.0230450Z 2025-05-07T20:32:28.0230657Z x_sign = torch.sign(x) 2025-05-07T20:32:28.0230950Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.0231273Z x = x_sign * x_clamp 2025-05-07T20:32:28.0231527Z x0 = x[:, :D] 2025-05-07T20:32:28.0231750Z x1 = x[:, D:] 2025-05-07T20:32:28.0231960Z 2025-05-07T20:32:28.0232153Z if contiguous: 2025-05-07T20:32:28.0232397Z x0 = x0.contiguous() 2025-05-07T20:32:28.0232660Z x1 = x1.contiguous() 2025-05-07T20:32:28.0232908Z 2025-05-07T20:32:28.0233109Z if scale_ub is not None: 2025-05-07T20:32:28.0233387Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.0233737Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.0234059Z ) 2025-05-07T20:32:28.0234512Z else: 2025-05-07T20:32:28.0234737Z scale_ub_tensor = None 2025-05-07T20:32:28.0235000Z 2025-05-07T20:32:28.0235236Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.0235561Z op = silu_mul_quant 2025-05-07T20:32:28.0235825Z if compiled: 2025-05-07T20:32:28.0236074Z op = torch.compile(op) 2025-05-07T20:32:28.0236386Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.0236673Z 2025-05-07T20:32:28.0236872Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.0237039Z 2025-05-07T20:32:28.0237143Z moe/activation_test.py:117: 2025-05-07T20:32:28.0237452Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.0237801Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.0238084Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.0238808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.0239527Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.0240190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.0240955Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.0241647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.0242277Z kernel = self.compile( 2025-05-07T20:32:28.0242833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.0243516Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.0244007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.0244241Z 2025-05-07T20:32:28.0244461Z self = 2025-05-07T20:32:28.0245592Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.0247039Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8aee57ce00>} 2025-05-07T20:32:28.0248442Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.0249508Z context = 2025-05-07T20:32:28.0249809Z 2025-05-07T20:32:28.0249986Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.0250525Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.0251008Z module_map=module_map) 2025-05-07T20:32:28.0251382Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.0251744Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.0252074Z E ^ 2025-05-07T20:32:28.0252577Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.0253054Z 2025-05-07T20:32:28.0253489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.0254024Z 2025-05-07T20:32:28.0254139Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.0254568Z self=, 2025-05-07T20:32:28.0254987Z T=1, 2025-05-07T20:32:28.0255184Z D=7168, 2025-05-07T20:32:28.0255380Z scale_ub=None, 2025-05-07T20:32:28.0255655Z contiguous=True, 2025-05-07T20:32:28.0255891Z compiled=True, 2025-05-07T20:32:28.0256108Z ) 2025-05-07T20:32:28.0256438Z self = 2025-05-07T20:32:28.0256945Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:28.0257213Z 2025-05-07T20:32:28.0257301Z @given( 2025-05-07T20:32:28.0257538Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.0257867Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.0258185Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.0258520Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.0258859Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.0259157Z ) 2025-05-07T20:32:28.0259520Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.0259973Z def test_silu_mul_quant( 2025-05-07T20:32:28.0260231Z self, 2025-05-07T20:32:28.0260440Z T: int, 2025-05-07T20:32:28.0260639Z D: int, 2025-05-07T20:32:28.0260864Z scale_ub: Optional[float], 2025-05-07T20:32:28.0261192Z contiguous: bool, 2025-05-07T20:32:28.0261437Z compiled: bool, 2025-05-07T20:32:28.0261667Z ) -> None: 2025-05-07T20:32:28.0261890Z torch.manual_seed(2025) 2025-05-07T20:32:28.0262133Z 2025-05-07T20:32:28.0262457Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.0262811Z 2025-05-07T20:32:28.0263008Z x_sign = torch.sign(x) 2025-05-07T20:32:28.0263312Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.0263631Z x = x_sign * x_clamp 2025-05-07T20:32:28.0263875Z x0 = x[:, :D] 2025-05-07T20:32:28.0264143Z x1 = x[:, D:] 2025-05-07T20:32:28.0264357Z 2025-05-07T20:32:28.0264545Z if contiguous: 2025-05-07T20:32:28.0264790Z x0 = x0.contiguous() 2025-05-07T20:32:28.0265060Z x1 = x1.contiguous() 2025-05-07T20:32:28.0265308Z 2025-05-07T20:32:28.0265506Z if scale_ub is not None: 2025-05-07T20:32:28.0265788Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.0266133Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.0266446Z ) 2025-05-07T20:32:28.0266649Z else: 2025-05-07T20:32:28.0266868Z scale_ub_tensor = None 2025-05-07T20:32:28.0267125Z 2025-05-07T20:32:28.0267366Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.0267695Z op = silu_mul_quant 2025-05-07T20:32:28.0267950Z if compiled: 2025-05-07T20:32:28.0268212Z op = torch.compile(op) 2025-05-07T20:32:28.0268524Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.0268805Z 2025-05-07T20:32:28.0269007Z y_fp8, y_scale = fn() 2025-05-07T20:32:28.0269308Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:28.0269604Z 2025-05-07T20:32:28.0269854Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.0270202Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:28.0270524Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:28.0270881Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:28.0271258Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.0271582Z 2025-05-07T20:32:28.0271788Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:28.0271996Z 2025-05-07T20:32:28.0272098Z moe/activation_test.py:126: 2025-05-07T20:32:28.0272415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.0272757Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:28.0273102Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.0273974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:28.0274758Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:28.0275319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.0276029Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.0276749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:28.0277500Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.0278255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:28.0278924Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:28.0279550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:28.0280082Z fn() 2025-05-07T20:32:28.0280615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:28.0281272Z self.fn.run( 2025-05-07T20:32:28.0281804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.0282354Z kernel = self.compile( 2025-05-07T20:32:28.0282913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.0283632Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.0284041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.0284285Z 2025-05-07T20:32:28.0284541Z self = 2025-05-07T20:32:28.0285676Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.0287111Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8aee5ab600>} 2025-05-07T20:32:28.0288519Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.0289587Z context = 2025-05-07T20:32:28.0289894Z 2025-05-07T20:32:28.0290064Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.0290615Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.0291098Z module_map=module_map) 2025-05-07T20:32:28.0291469Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.0291899Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:28.0292179Z E ^ 2025-05-07T20:32:28.0292661Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.0293141Z 2025-05-07T20:32:28.0293574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.0294116Z 2025-05-07T20:32:28.0294226Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.0294654Z self=, 2025-05-07T20:32:28.0295067Z T=4096, 2025-05-07T20:32:28.0295263Z D=5120, 2025-05-07T20:32:28.0295463Z scale_ub=None, 2025-05-07T20:32:28.0295684Z contiguous=False, 2025-05-07T20:32:28.0295920Z compiled=False, 2025-05-07T20:32:28.0296132Z ) 2025-05-07T20:32:28.8252612Z self = 2025-05-07T20:32:28.8253178Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:28.8253492Z 2025-05-07T20:32:28.8253575Z @given( 2025-05-07T20:32:28.8253810Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.8254126Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.8254430Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.8254766Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.8255095Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.8255386Z ) 2025-05-07T20:32:28.8255749Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.8256209Z def test_silu_mul_quant( 2025-05-07T20:32:28.8256454Z self, 2025-05-07T20:32:28.8256660Z T: int, 2025-05-07T20:32:28.8256865Z D: int, 2025-05-07T20:32:28.8257089Z scale_ub: Optional[float], 2025-05-07T20:32:28.8257372Z contiguous: bool, 2025-05-07T20:32:28.8257625Z compiled: bool, 2025-05-07T20:32:28.8257854Z ) -> None: 2025-05-07T20:32:28.8258156Z torch.manual_seed(2025) 2025-05-07T20:32:28.8258409Z 2025-05-07T20:32:28.8258692Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.8259037Z 2025-05-07T20:32:28.8259240Z x_sign = torch.sign(x) 2025-05-07T20:32:28.8259609Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.8259927Z x = x_sign * x_clamp 2025-05-07T20:32:28.8260175Z x0 = x[:, :D] 2025-05-07T20:32:28.8260398Z x1 = x[:, D:] 2025-05-07T20:32:28.8260606Z 2025-05-07T20:32:28.8260798Z if contiguous: 2025-05-07T20:32:28.8261137Z x0 = x0.contiguous() 2025-05-07T20:32:28.8261399Z x1 = x1.contiguous() 2025-05-07T20:32:28.8261648Z 2025-05-07T20:32:28.8261848Z if scale_ub is not None: 2025-05-07T20:32:28.8262163Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.8262513Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.8262826Z ) 2025-05-07T20:32:28.8263031Z else: 2025-05-07T20:32:28.8263248Z scale_ub_tensor = None 2025-05-07T20:32:28.8263501Z 2025-05-07T20:32:28.8263742Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.8264075Z op = silu_mul_quant 2025-05-07T20:32:28.8264333Z if compiled: 2025-05-07T20:32:28.8264591Z op = torch.compile(op) 2025-05-07T20:32:28.8264899Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.8265177Z 2025-05-07T20:32:28.8265381Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.8265561Z 2025-05-07T20:32:28.8265664Z moe/activation_test.py:117: 2025-05-07T20:32:28.8265973Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.8266311Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.8266607Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.8267333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.8268048Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.8268609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.8269323Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.8270016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.8270569Z kernel = self.compile( 2025-05-07T20:32:28.8271132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.8271826Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.8272282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.8272523Z 2025-05-07T20:32:28.8272737Z self = 2025-05-07T20:32:28.8273867Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.8275300Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8aee5ab420>} 2025-05-07T20:32:28.8276702Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.8277762Z context = 2025-05-07T20:32:28.8278070Z 2025-05-07T20:32:28.8278240Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.8278829Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.8279316Z module_map=module_map) 2025-05-07T20:32:28.8279684Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.8280087Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.8280352Z E ^ 2025-05-07T20:32:28.8280826Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.8281298Z 2025-05-07T20:32:28.8281726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.8282309Z 2025-05-07T20:32:28.8282417Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.8282852Z self=, 2025-05-07T20:32:28.8283271Z T=4096, 2025-05-07T20:32:28.8283467Z D=7168, 2025-05-07T20:32:28.8291393Z scale_ub=None, 2025-05-07T20:32:28.8291651Z contiguous=False, 2025-05-07T20:32:28.8291978Z compiled=False, 2025-05-07T20:32:28.8292192Z ) 2025-05-07T20:32:28.8292518Z self = 2025-05-07T20:32:28.8293044Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:28.8293332Z 2025-05-07T20:32:28.8293422Z @given( 2025-05-07T20:32:28.8293662Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.8293991Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.8294313Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.8294662Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.8294998Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.8295297Z ) 2025-05-07T20:32:28.8295666Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.8296118Z def test_silu_mul_quant( 2025-05-07T20:32:28.8296373Z self, 2025-05-07T20:32:28.8296583Z T: int, 2025-05-07T20:32:28.8296784Z D: int, 2025-05-07T20:32:28.8297012Z scale_ub: Optional[float], 2025-05-07T20:32:28.8297289Z contiguous: bool, 2025-05-07T20:32:28.8297537Z compiled: bool, 2025-05-07T20:32:28.8297756Z ) -> None: 2025-05-07T20:32:28.8297976Z torch.manual_seed(2025) 2025-05-07T20:32:28.8298224Z 2025-05-07T20:32:28.8298497Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.8298850Z 2025-05-07T20:32:28.8299051Z x_sign = torch.sign(x) 2025-05-07T20:32:28.8299351Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.8299668Z x = x_sign * x_clamp 2025-05-07T20:32:28.8299919Z x0 = x[:, :D] 2025-05-07T20:32:28.8300133Z x1 = x[:, D:] 2025-05-07T20:32:28.8300421Z 2025-05-07T20:32:28.8300616Z if contiguous: 2025-05-07T20:32:28.8300850Z x0 = x0.contiguous() 2025-05-07T20:32:28.8301126Z x1 = x1.contiguous() 2025-05-07T20:32:28.8301372Z 2025-05-07T20:32:28.8301562Z if scale_ub is not None: 2025-05-07T20:32:28.8301844Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.8302187Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.8302506Z ) 2025-05-07T20:32:28.8302696Z else: 2025-05-07T20:32:28.8302911Z scale_ub_tensor = None 2025-05-07T20:32:28.8303165Z 2025-05-07T20:32:28.8303399Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.8303722Z op = silu_mul_quant 2025-05-07T20:32:28.8303981Z if compiled: 2025-05-07T20:32:28.8304226Z op = torch.compile(op) 2025-05-07T20:32:28.8304530Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.8304812Z 2025-05-07T20:32:28.8305005Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.8305180Z 2025-05-07T20:32:28.8305281Z moe/activation_test.py:117: 2025-05-07T20:32:28.8305640Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.8305980Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.8306513Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.8307317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.8308030Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.8308577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.8309348Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.8310037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.8310592Z kernel = self.compile( 2025-05-07T20:32:28.8311145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.8311823Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.8312237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.8312478Z 2025-05-07T20:32:28.8312694Z self = 2025-05-07T20:32:28.8313818Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.8315246Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8aee58c360>} 2025-05-07T20:32:28.8316638Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.8317702Z context = 2025-05-07T20:32:28.8317999Z 2025-05-07T20:32:28.8318172Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.8318715Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.8319198Z module_map=module_map) 2025-05-07T20:32:28.8319566Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.8319936Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.8320205Z E ^ 2025-05-07T20:32:28.8320688Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.8321227Z 2025-05-07T20:32:28.8321657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.8322193Z 2025-05-07T20:32:28.8322301Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.8322730Z self=, 2025-05-07T20:32:28.8323149Z T=128, 2025-05-07T20:32:28.8323338Z D=7168, 2025-05-07T20:32:28.8323540Z scale_ub=None, 2025-05-07T20:32:28.8323761Z contiguous=False, 2025-05-07T20:32:28.8323988Z compiled=True, 2025-05-07T20:32:28.8324197Z ) 2025-05-07T20:32:28.8863434Z self = 2025-05-07T20:32:28.8863987Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:28.8864272Z 2025-05-07T20:32:28.8864350Z @given( 2025-05-07T20:32:28.8864585Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.8864909Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.8865218Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.8865555Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.8865994Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.8866286Z ) 2025-05-07T20:32:28.8866642Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.8867160Z def test_silu_mul_quant( 2025-05-07T20:32:28.8867416Z self, 2025-05-07T20:32:28.8867618Z T: int, 2025-05-07T20:32:28.8867828Z D: int, 2025-05-07T20:32:28.8868060Z scale_ub: Optional[float], 2025-05-07T20:32:28.8868332Z contiguous: bool, 2025-05-07T20:32:28.8868578Z compiled: bool, 2025-05-07T20:32:28.8868881Z ) -> None: 2025-05-07T20:32:28.8869095Z torch.manual_seed(2025) 2025-05-07T20:32:28.8869345Z 2025-05-07T20:32:28.8869623Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.8869968Z 2025-05-07T20:32:28.8870170Z x_sign = torch.sign(x) 2025-05-07T20:32:28.8870469Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.8870790Z x = x_sign * x_clamp 2025-05-07T20:32:28.8871034Z x0 = x[:, :D] 2025-05-07T20:32:28.8871257Z x1 = x[:, D:] 2025-05-07T20:32:28.8871470Z 2025-05-07T20:32:28.8871656Z if contiguous: 2025-05-07T20:32:28.8871898Z x0 = x0.contiguous() 2025-05-07T20:32:28.8872172Z x1 = x1.contiguous() 2025-05-07T20:32:28.8872418Z 2025-05-07T20:32:28.8872614Z if scale_ub is not None: 2025-05-07T20:32:28.8872894Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.8873228Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.8873545Z ) 2025-05-07T20:32:28.8873741Z else: 2025-05-07T20:32:28.8873948Z scale_ub_tensor = None 2025-05-07T20:32:28.8874285Z 2025-05-07T20:32:28.8874576Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.8874903Z op = silu_mul_quant 2025-05-07T20:32:28.8875164Z if compiled: 2025-05-07T20:32:28.8875423Z op = torch.compile(op) 2025-05-07T20:32:28.8875717Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.8876003Z 2025-05-07T20:32:28.8876205Z y_fp8, y_scale = fn() 2025-05-07T20:32:28.8876509Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:28.8876805Z 2025-05-07T20:32:28.8877048Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.8877388Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:28.8877684Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:28.8878007Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:28.8878374Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.8878688Z 2025-05-07T20:32:28.8878980Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:28.8879182Z 2025-05-07T20:32:28.8879289Z moe/activation_test.py:126: 2025-05-07T20:32:28.8879594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.8879933Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:28.8880266Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.8881127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:28.8881905Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:28.8882464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.8883171Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.8883882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:28.8884624Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.8885425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:28.8886084Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:28.8886704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:28.8887274Z fn() 2025-05-07T20:32:28.8887797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:28.8888399Z self.fn.run( 2025-05-07T20:32:28.8888875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.8889462Z kernel = self.compile( 2025-05-07T20:32:28.8890020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.8890693Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.8891100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.8891339Z 2025-05-07T20:32:28.8891550Z self = 2025-05-07T20:32:28.8892740Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.8894171Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac9f5e7a0>} 2025-05-07T20:32:28.8895557Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.8896624Z context = 2025-05-07T20:32:28.8896928Z 2025-05-07T20:32:28.8897099Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.8897637Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.8898119Z module_map=module_map) 2025-05-07T20:32:28.8898496Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.8898861Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:28.8899128Z E ^ 2025-05-07T20:32:28.8899609Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.8900087Z 2025-05-07T20:32:28.8900515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.8901092Z 2025-05-07T20:32:28.8901204Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.8901630Z self=, 2025-05-07T20:32:28.8902044Z T=128, 2025-05-07T20:32:28.8902237Z D=7168, 2025-05-07T20:32:28.8902430Z scale_ub=None, 2025-05-07T20:32:28.8902651Z contiguous=False, 2025-05-07T20:32:28.8902886Z compiled=False, 2025-05-07T20:32:28.8903088Z ) 2025-05-07T20:32:29.0847392Z self = 2025-05-07T20:32:29.0848505Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:29.0849076Z 2025-05-07T20:32:29.0849239Z @given( 2025-05-07T20:32:29.0849723Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.0850344Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.0850849Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.0851236Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.0851568Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.0851922Z ) 2025-05-07T20:32:29.0852394Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.0852847Z def test_silu_mul_quant( 2025-05-07T20:32:29.0853101Z self, 2025-05-07T20:32:29.0853304Z T: int, 2025-05-07T20:32:29.0853593Z D: int, 2025-05-07T20:32:29.0853817Z scale_ub: Optional[float], 2025-05-07T20:32:29.0854095Z contiguous: bool, 2025-05-07T20:32:29.0854342Z compiled: bool, 2025-05-07T20:32:29.0854572Z ) -> None: 2025-05-07T20:32:29.0854793Z torch.manual_seed(2025) 2025-05-07T20:32:29.0855041Z 2025-05-07T20:32:29.0855385Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.0855735Z 2025-05-07T20:32:29.0855940Z x_sign = torch.sign(x) 2025-05-07T20:32:29.0856238Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.0856558Z x = x_sign * x_clamp 2025-05-07T20:32:29.0856804Z x0 = x[:, :D] 2025-05-07T20:32:29.0857021Z x1 = x[:, D:] 2025-05-07T20:32:29.0857240Z 2025-05-07T20:32:29.0857432Z if contiguous: 2025-05-07T20:32:29.0857667Z x0 = x0.contiguous() 2025-05-07T20:32:29.0857933Z x1 = x1.contiguous() 2025-05-07T20:32:29.0858185Z 2025-05-07T20:32:29.0858376Z if scale_ub is not None: 2025-05-07T20:32:29.0858654Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.0858994Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.0859317Z ) 2025-05-07T20:32:29.0859513Z else: 2025-05-07T20:32:29.0859734Z scale_ub_tensor = None 2025-05-07T20:32:29.0859991Z 2025-05-07T20:32:29.0860224Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.0860554Z op = silu_mul_quant 2025-05-07T20:32:29.0860813Z if compiled: 2025-05-07T20:32:29.0861067Z op = torch.compile(op) 2025-05-07T20:32:29.0861375Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.0861658Z 2025-05-07T20:32:29.0861855Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.0862028Z 2025-05-07T20:32:29.0862135Z moe/activation_test.py:117: 2025-05-07T20:32:29.0862440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.0862781Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.0863072Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.0863786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.0864505Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.0865054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.0865835Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.0866524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.0867079Z kernel = self.compile( 2025-05-07T20:32:29.0867632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.0868311Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.0868731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.0868971Z 2025-05-07T20:32:29.0869182Z self = 2025-05-07T20:32:29.0870308Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.0871932Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac9f5e980>} 2025-05-07T20:32:29.0873381Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.0874486Z context = 2025-05-07T20:32:29.0874784Z 2025-05-07T20:32:29.0874954Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.0875493Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.0876016Z module_map=module_map) 2025-05-07T20:32:29.0876391Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.0876749Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.0877016Z E ^ 2025-05-07T20:32:29.0877494Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.0877962Z 2025-05-07T20:32:29.0878391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.0878930Z 2025-05-07T20:32:29.0879037Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.0879465Z self=, 2025-05-07T20:32:29.0879880Z T=4096, 2025-05-07T20:32:29.0880071Z D=5120, 2025-05-07T20:32:29.0880270Z scale_ub=1200.0, 2025-05-07T20:32:29.0880499Z contiguous=True, 2025-05-07T20:32:29.0880724Z compiled=False, 2025-05-07T20:32:29.0880969Z ) 2025-05-07T20:32:29.0881325Z self = 2025-05-07T20:32:29.0881834Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:29.0882126Z 2025-05-07T20:32:29.0882205Z @given( 2025-05-07T20:32:29.0882446Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.0882766Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.0883079Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.0883418Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.0883758Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.0884046Z ) 2025-05-07T20:32:29.0884409Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.0884865Z def test_silu_mul_quant( 2025-05-07T20:32:29.0885111Z self, 2025-05-07T20:32:29.0885317Z T: int, 2025-05-07T20:32:29.0885524Z D: int, 2025-05-07T20:32:29.0885744Z scale_ub: Optional[float], 2025-05-07T20:32:29.0886022Z contiguous: bool, 2025-05-07T20:32:29.0886271Z compiled: bool, 2025-05-07T20:32:29.0886495Z ) -> None: 2025-05-07T20:32:29.0886770Z torch.manual_seed(2025) 2025-05-07T20:32:29.0887022Z 2025-05-07T20:32:29.0887299Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.0887652Z 2025-05-07T20:32:29.0887850Z x_sign = torch.sign(x) 2025-05-07T20:32:29.0888144Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.0888468Z x = x_sign * x_clamp 2025-05-07T20:32:29.0888716Z x0 = x[:, :D] 2025-05-07T20:32:29.0888944Z x1 = x[:, D:] 2025-05-07T20:32:29.0889152Z 2025-05-07T20:32:29.0889342Z if contiguous: 2025-05-07T20:32:29.0889584Z x0 = x0.contiguous() 2025-05-07T20:32:29.0889846Z x1 = x1.contiguous() 2025-05-07T20:32:29.0890100Z 2025-05-07T20:32:29.0890294Z if scale_ub is not None: 2025-05-07T20:32:29.0890571Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.0890911Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.0891230Z ) 2025-05-07T20:32:29.0891422Z else: 2025-05-07T20:32:29.0891636Z scale_ub_tensor = None 2025-05-07T20:32:29.0891958Z 2025-05-07T20:32:29.0892242Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.0892564Z op = silu_mul_quant 2025-05-07T20:32:29.0892826Z if compiled: 2025-05-07T20:32:29.0893076Z op = torch.compile(op) 2025-05-07T20:32:29.0893420Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.0893699Z 2025-05-07T20:32:29.0893896Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.0894064Z 2025-05-07T20:32:29.0894165Z moe/activation_test.py:117: 2025-05-07T20:32:29.0894469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.0894852Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.0895135Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.0895843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.0896555Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.0897108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.0897809Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.0898494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.0899042Z kernel = self.compile( 2025-05-07T20:32:29.0899596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.0900271Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.0900679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.0900915Z 2025-05-07T20:32:29.0901133Z self = 2025-05-07T20:32:29.0902249Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.0903664Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac9f5f9c0>} 2025-05-07T20:32:29.0905072Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.0906344Z context = 2025-05-07T20:32:29.0906681Z 2025-05-07T20:32:29.0906858Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.0907489Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.0907974Z module_map=module_map) 2025-05-07T20:32:29.0908348Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.0908710Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.0908979Z E ^ 2025-05-07T20:32:29.0909461Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.0909927Z 2025-05-07T20:32:29.0910360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.0910942Z 2025-05-07T20:32:29.0911051Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.0911483Z self=, 2025-05-07T20:32:29.0911901Z T=1, 2025-05-07T20:32:29.0912096Z D=5120, 2025-05-07T20:32:29.0912289Z scale_ub=None, 2025-05-07T20:32:29.0912513Z contiguous=True, 2025-05-07T20:32:29.0912739Z compiled=True, 2025-05-07T20:32:29.0912945Z ) 2025-05-07T20:32:29.4730906Z self = 2025-05-07T20:32:29.4731448Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:29.4731716Z 2025-05-07T20:32:29.4731794Z @given( 2025-05-07T20:32:29.4732139Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4732454Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4732761Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4733090Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4733423Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4733776Z ) 2025-05-07T20:32:29.4734127Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4734579Z def test_silu_mul_quant( 2025-05-07T20:32:29.4734830Z self, 2025-05-07T20:32:29.4735028Z T: int, 2025-05-07T20:32:29.4735230Z D: int, 2025-05-07T20:32:29.4735453Z scale_ub: Optional[float], 2025-05-07T20:32:29.4735732Z contiguous: bool, 2025-05-07T20:32:29.4735977Z compiled: bool, 2025-05-07T20:32:29.4736207Z ) -> None: 2025-05-07T20:32:29.4736428Z torch.manual_seed(2025) 2025-05-07T20:32:29.4736685Z 2025-05-07T20:32:29.4736964Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4737311Z 2025-05-07T20:32:29.4743408Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4743743Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4744075Z x = x_sign * x_clamp 2025-05-07T20:32:29.4744330Z x0 = x[:, :D] 2025-05-07T20:32:29.4744554Z x1 = x[:, D:] 2025-05-07T20:32:29.4744767Z 2025-05-07T20:32:29.4744961Z if contiguous: 2025-05-07T20:32:29.4745201Z x0 = x0.contiguous() 2025-05-07T20:32:29.4745466Z x1 = x1.contiguous() 2025-05-07T20:32:29.4745709Z 2025-05-07T20:32:29.4745909Z if scale_ub is not None: 2025-05-07T20:32:29.4746186Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4746527Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4746840Z ) 2025-05-07T20:32:29.4747034Z else: 2025-05-07T20:32:29.4747246Z scale_ub_tensor = None 2025-05-07T20:32:29.4747504Z 2025-05-07T20:32:29.4747736Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4748058Z op = silu_mul_quant 2025-05-07T20:32:29.4748313Z if compiled: 2025-05-07T20:32:29.4748566Z op = torch.compile(op) 2025-05-07T20:32:29.4748867Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4749147Z 2025-05-07T20:32:29.4749344Z y_fp8, y_scale = fn() 2025-05-07T20:32:29.4749629Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:29.4750030Z 2025-05-07T20:32:29.4750278Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4750615Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:29.4750919Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:29.4751245Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:29.4751609Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.4751926Z 2025-05-07T20:32:29.4752128Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:29.4752327Z 2025-05-07T20:32:29.4752436Z moe/activation_test.py:126: 2025-05-07T20:32:29.4752737Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4753085Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:29.4753418Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.4754236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:29.4755016Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:29.4755649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4756353Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4757056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:29.4757838Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.4758595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:29.4759293Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:29.4759911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:29.4760446Z fn() 2025-05-07T20:32:29.4760981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:29.4761618Z self.fn.run( 2025-05-07T20:32:29.4762099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4762650Z kernel = self.compile( 2025-05-07T20:32:29.4763210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4763879Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4764285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4764522Z 2025-05-07T20:32:29.4764739Z self = 2025-05-07T20:32:29.4765861Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4767275Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8aee57c400>} 2025-05-07T20:32:29.4768663Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4769724Z context = 2025-05-07T20:32:29.4770019Z 2025-05-07T20:32:29.4770192Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4770730Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4771300Z module_map=module_map) 2025-05-07T20:32:29.4771688Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4772104Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:29.4772372Z E ^ 2025-05-07T20:32:29.4772847Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4773314Z 2025-05-07T20:32:29.4773746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4774277Z 2025-05-07T20:32:29.4774383Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4774808Z self=, 2025-05-07T20:32:29.4775226Z T=2048, 2025-05-07T20:32:29.4775426Z D=5120, 2025-05-07T20:32:29.4775617Z scale_ub=None, 2025-05-07T20:32:29.4775839Z contiguous=True, 2025-05-07T20:32:29.4776067Z compiled=True, 2025-05-07T20:32:29.4776266Z ) 2025-05-07T20:32:29.8459797Z self = 2025-05-07T20:32:29.8460877Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:29.8461559Z 2025-05-07T20:32:29.8461666Z @given( 2025-05-07T20:32:29.8461922Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.8462237Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.8462600Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.8462929Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.8463253Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.8463533Z ) 2025-05-07T20:32:29.8463887Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.8464398Z def test_silu_mul_quant( 2025-05-07T20:32:29.8464643Z self, 2025-05-07T20:32:29.8464840Z T: int, 2025-05-07T20:32:29.8465033Z D: int, 2025-05-07T20:32:29.8465254Z scale_ub: Optional[float], 2025-05-07T20:32:29.8465524Z contiguous: bool, 2025-05-07T20:32:29.8465759Z compiled: bool, 2025-05-07T20:32:29.8465983Z ) -> None: 2025-05-07T20:32:29.8466199Z torch.manual_seed(2025) 2025-05-07T20:32:29.8466435Z 2025-05-07T20:32:29.8466705Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.8467049Z 2025-05-07T20:32:29.8467249Z x_sign = torch.sign(x) 2025-05-07T20:32:29.8467536Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.8467846Z x = x_sign * x_clamp 2025-05-07T20:32:29.8468087Z x0 = x[:, :D] 2025-05-07T20:32:29.8468297Z x1 = x[:, D:] 2025-05-07T20:32:29.8468503Z 2025-05-07T20:32:29.8468690Z if contiguous: 2025-05-07T20:32:29.8468915Z x0 = x0.contiguous() 2025-05-07T20:32:29.8469173Z x1 = x1.contiguous() 2025-05-07T20:32:29.8469410Z 2025-05-07T20:32:29.8469597Z if scale_ub is not None: 2025-05-07T20:32:29.8469870Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.8470206Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.8470512Z ) 2025-05-07T20:32:29.8470705Z else: 2025-05-07T20:32:29.8470917Z scale_ub_tensor = None 2025-05-07T20:32:29.8471162Z 2025-05-07T20:32:29.8471389Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.8471710Z op = silu_mul_quant 2025-05-07T20:32:29.8471959Z if compiled: 2025-05-07T20:32:29.8472203Z op = torch.compile(op) 2025-05-07T20:32:29.8472505Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.8472777Z 2025-05-07T20:32:29.8472965Z y_fp8, y_scale = fn() 2025-05-07T20:32:29.8473250Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:29.8473543Z 2025-05-07T20:32:29.8473776Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.8474183Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:29.8474479Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:29.8474789Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:29.8475150Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.8475459Z 2025-05-07T20:32:29.8475652Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:29.8475854Z 2025-05-07T20:32:29.8475953Z moe/activation_test.py:126: 2025-05-07T20:32:29.8476254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.8476586Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:29.8476907Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.8477714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:29.8478491Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:29.8479045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.8479794Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.8480501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:29.8481249Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.8482036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:29.8482693Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:29.8483309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:29.8483881Z fn() 2025-05-07T20:32:29.8484399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:29.8484999Z self.fn.run( 2025-05-07T20:32:29.8485482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.8486022Z kernel = self.compile( 2025-05-07T20:32:29.8486574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.8487248Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.8487648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.8487883Z 2025-05-07T20:32:29.8488094Z self = 2025-05-07T20:32:29.8489211Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.8490643Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac9d07ba0>} 2025-05-07T20:32:29.8492101Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.8493169Z context = 2025-05-07T20:32:29.8493463Z 2025-05-07T20:32:29.8493634Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.8494169Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.8494652Z module_map=module_map) 2025-05-07T20:32:29.8495019Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.8495381Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:29.8495700Z E ^ 2025-05-07T20:32:29.8496177Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.8496643Z 2025-05-07T20:32:29.8497070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.8497601Z 2025-05-07T20:32:29.8497709Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.8498131Z self=, 2025-05-07T20:32:29.8498537Z T=128, 2025-05-07T20:32:29.8498727Z D=5120, 2025-05-07T20:32:29.8498919Z scale_ub=None, 2025-05-07T20:32:29.8499131Z contiguous=True, 2025-05-07T20:32:29.8499357Z compiled=True, 2025-05-07T20:32:29.8499559Z ) 2025-05-07T20:32:30.2789842Z self = 2025-05-07T20:32:30.2790377Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:30.2790663Z 2025-05-07T20:32:30.2790758Z @given( 2025-05-07T20:32:30.2790987Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.2791419Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.2791821Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.2792271Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.2792831Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.2793198Z ) 2025-05-07T20:32:30.2793576Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.2794028Z def test_silu_mul_quant( 2025-05-07T20:32:30.2794270Z self, 2025-05-07T20:32:30.2794474Z T: int, 2025-05-07T20:32:30.2794751Z D: int, 2025-05-07T20:32:30.2794977Z scale_ub: Optional[float], 2025-05-07T20:32:30.2795250Z contiguous: bool, 2025-05-07T20:32:30.2795492Z compiled: bool, 2025-05-07T20:32:30.2795723Z ) -> None: 2025-05-07T20:32:30.2795938Z torch.manual_seed(2025) 2025-05-07T20:32:30.2796184Z 2025-05-07T20:32:30.2796463Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.2796806Z 2025-05-07T20:32:30.2797004Z x_sign = torch.sign(x) 2025-05-07T20:32:30.2797298Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.2797613Z x = x_sign * x_clamp 2025-05-07T20:32:30.2797855Z x0 = x[:, :D] 2025-05-07T20:32:30.2798098Z x1 = x[:, D:] 2025-05-07T20:32:30.2798307Z 2025-05-07T20:32:30.2798494Z if contiguous: 2025-05-07T20:32:30.2798731Z x0 = x0.contiguous() 2025-05-07T20:32:30.2798997Z x1 = x1.contiguous() 2025-05-07T20:32:30.2799239Z 2025-05-07T20:32:30.2799434Z if scale_ub is not None: 2025-05-07T20:32:30.2799713Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.2800047Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.2800362Z ) 2025-05-07T20:32:30.2800557Z else: 2025-05-07T20:32:30.2800771Z scale_ub_tensor = None 2025-05-07T20:32:30.2801019Z 2025-05-07T20:32:30.2801258Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.2801575Z op = silu_mul_quant 2025-05-07T20:32:30.2801825Z if compiled: 2025-05-07T20:32:30.2802074Z op = torch.compile(op) 2025-05-07T20:32:30.2802372Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.2802644Z 2025-05-07T20:32:30.2802839Z y_fp8, y_scale = fn() 2025-05-07T20:32:30.2803128Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:30.2803420Z 2025-05-07T20:32:30.2803660Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.2804003Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:30.2804295Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:30.2804687Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:30.2805054Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.2805370Z 2025-05-07T20:32:30.2805569Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:30.2805768Z 2025-05-07T20:32:30.2805869Z moe/activation_test.py:126: 2025-05-07T20:32:30.2806403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.2806760Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:30.2807095Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.2807916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:30.2808695Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:30.2809251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.2809956Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.2810743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:30.2811541Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:30.2812347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:30.2813064Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:30.2813682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:30.2814207Z fn() 2025-05-07T20:32:30.2814724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:30.2815391Z self.fn.run( 2025-05-07T20:32:30.2815865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.2816412Z kernel = self.compile( 2025-05-07T20:32:30.2816970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.2817642Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.2818045Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.2818288Z 2025-05-07T20:32:30.2818501Z self = 2025-05-07T20:32:30.2819615Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.2821051Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac90c9300>} 2025-05-07T20:32:30.2822496Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.2823550Z context = 2025-05-07T20:32:30.2823849Z 2025-05-07T20:32:30.2824021Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.2824554Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.2825024Z module_map=module_map) 2025-05-07T20:32:30.2825394Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.2825761Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:30.2826028Z E ^ 2025-05-07T20:32:30.2826566Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.2827035Z 2025-05-07T20:32:30.2827470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.2827999Z 2025-05-07T20:32:30.2828104Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.2828525Z self=, 2025-05-07T20:32:30.2828933Z T=4096, 2025-05-07T20:32:30.2829124Z D=5120, 2025-05-07T20:32:30.2829317Z scale_ub=None, 2025-05-07T20:32:30.2829529Z contiguous=True, 2025-05-07T20:32:30.2829751Z compiled=True, 2025-05-07T20:32:30.2829953Z ) 2025-05-07T20:32:30.7150448Z self = 2025-05-07T20:32:30.7151042Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:30.7151327Z 2025-05-07T20:32:30.7151411Z @given( 2025-05-07T20:32:30.7151667Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.7151992Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.7152312Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.7152939Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.7153288Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.7153586Z ) 2025-05-07T20:32:30.7153943Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.7154504Z def test_silu_mul_quant( 2025-05-07T20:32:30.7154761Z self, 2025-05-07T20:32:30.7154962Z T: int, 2025-05-07T20:32:30.7155177Z D: int, 2025-05-07T20:32:30.7155414Z scale_ub: Optional[float], 2025-05-07T20:32:30.7155692Z contiguous: bool, 2025-05-07T20:32:30.7156075Z compiled: bool, 2025-05-07T20:32:30.7156321Z ) -> None: 2025-05-07T20:32:30.7156540Z torch.manual_seed(2025) 2025-05-07T20:32:30.7156795Z 2025-05-07T20:32:30.7157088Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.7157449Z 2025-05-07T20:32:30.7157646Z x_sign = torch.sign(x) 2025-05-07T20:32:30.7157955Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.7158283Z x = x_sign * x_clamp 2025-05-07T20:32:30.7158530Z x0 = x[:, :D] 2025-05-07T20:32:30.7158760Z x1 = x[:, D:] 2025-05-07T20:32:30.7158979Z 2025-05-07T20:32:30.7159166Z if contiguous: 2025-05-07T20:32:30.7159411Z x0 = x0.contiguous() 2025-05-07T20:32:30.7159683Z x1 = x1.contiguous() 2025-05-07T20:32:30.7159924Z 2025-05-07T20:32:30.7160124Z if scale_ub is not None: 2025-05-07T20:32:30.7160408Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.7160756Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.7161080Z ) 2025-05-07T20:32:30.7161282Z else: 2025-05-07T20:32:30.7161495Z scale_ub_tensor = None 2025-05-07T20:32:30.7161759Z 2025-05-07T20:32:30.7162026Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.7162356Z op = silu_mul_quant 2025-05-07T20:32:30.7162617Z if compiled: 2025-05-07T20:32:30.7162875Z op = torch.compile(op) 2025-05-07T20:32:30.7163186Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.7163469Z 2025-05-07T20:32:30.7163675Z y_fp8, y_scale = fn() 2025-05-07T20:32:30.7163980Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:30.7164274Z 2025-05-07T20:32:30.7164527Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.7164871Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:30.7165176Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:30.7165502Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:30.7165880Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.7166207Z 2025-05-07T20:32:30.7166507Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:30.7166722Z 2025-05-07T20:32:30.7166831Z moe/activation_test.py:126: 2025-05-07T20:32:30.7167152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.7167497Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:30.7167843Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.7168675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:30.7169473Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:30.7170037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.7170755Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.7171480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:30.7172343Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:30.7173149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:30.7173818Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:30.7174490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:30.7175030Z fn() 2025-05-07T20:32:30.7175561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:30.7176175Z self.fn.run( 2025-05-07T20:32:30.7176710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.7177262Z kernel = self.compile( 2025-05-07T20:32:30.7177832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.7178523Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.7178936Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.7179184Z 2025-05-07T20:32:30.7179402Z self = 2025-05-07T20:32:30.7180545Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.7182109Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac902a340>} 2025-05-07T20:32:30.7183516Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.7184586Z context = 2025-05-07T20:32:30.7184885Z 2025-05-07T20:32:30.7185060Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.7185605Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.7186101Z module_map=module_map) 2025-05-07T20:32:30.7186477Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.7186853Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:30.7187134Z E ^ 2025-05-07T20:32:30.7187623Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.7188093Z 2025-05-07T20:32:30.7188576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.7189122Z 2025-05-07T20:32:30.7189230Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.7189669Z self=, 2025-05-07T20:32:30.7190097Z T=16384, 2025-05-07T20:32:30.7190298Z D=5120, 2025-05-07T20:32:30.7190505Z scale_ub=None, 2025-05-07T20:32:30.7190751Z contiguous=True, 2025-05-07T20:32:30.7190992Z compiled=True, 2025-05-07T20:32:30.7191207Z ) 2025-05-07T20:32:30.7454999Z W0507 20:32:30.742000 97758 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:30.7456402Z W0507 20:32:30.742000 97758 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:30.7457801Z W0507 20:32:30.742000 97758 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:30.7459040Z W0507 20:32:30.742000 97758 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:30.7460192Z W0507 20:32:30.742000 97758 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:30.8327736Z self = 2025-05-07T20:32:30.8328976Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:30.8329546Z 2025-05-07T20:32:30.8330061Z @given( 2025-05-07T20:32:30.8330526Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.8331164Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.8331526Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.8331925Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.8332268Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.8332573Z ) 2025-05-07T20:32:30.8332931Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.8333394Z def test_silu_mul_quant( 2025-05-07T20:32:30.8333654Z self, 2025-05-07T20:32:30.8333862Z T: int, 2025-05-07T20:32:30.8334075Z D: int, 2025-05-07T20:32:30.8334308Z scale_ub: Optional[float], 2025-05-07T20:32:30.8334599Z contiguous: bool, 2025-05-07T20:32:30.8334847Z compiled: bool, 2025-05-07T20:32:30.8335092Z ) -> None: 2025-05-07T20:32:30.8335321Z torch.manual_seed(2025) 2025-05-07T20:32:30.8335573Z 2025-05-07T20:32:30.8335860Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.8336220Z 2025-05-07T20:32:30.8336419Z x_sign = torch.sign(x) 2025-05-07T20:32:30.8336728Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.8337054Z x = x_sign * x_clamp 2025-05-07T20:32:30.8337299Z x0 = x[:, :D] 2025-05-07T20:32:30.8337532Z x1 = x[:, D:] 2025-05-07T20:32:30.8337753Z 2025-05-07T20:32:30.8337944Z if contiguous: 2025-05-07T20:32:30.8338193Z x0 = x0.contiguous() 2025-05-07T20:32:30.8338466Z x1 = x1.contiguous() 2025-05-07T20:32:30.8338713Z 2025-05-07T20:32:30.8338917Z if scale_ub is not None: 2025-05-07T20:32:30.8339207Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.8339547Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.8339870Z ) 2025-05-07T20:32:30.8340079Z else: 2025-05-07T20:32:30.8340303Z scale_ub_tensor = None 2025-05-07T20:32:30.8340558Z 2025-05-07T20:32:30.8340805Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.8341234Z op = silu_mul_quant 2025-05-07T20:32:30.8341536Z if compiled: 2025-05-07T20:32:30.8341806Z op = torch.compile(op) 2025-05-07T20:32:30.8342123Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.8342404Z 2025-05-07T20:32:30.8342613Z y_fp8, y_scale = fn() 2025-05-07T20:32:30.8342916Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:30.8343218Z 2025-05-07T20:32:30.8343472Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.8343824Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:30.8344125Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:30.8344458Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:30.8344837Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.8345162Z 2025-05-07T20:32:30.8345370Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:30.8345582Z 2025-05-07T20:32:30.8345687Z moe/activation_test.py:126: 2025-05-07T20:32:30.8346004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.8346351Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:30.8346782Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.8347610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:30.8348471Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:30.8349034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.8349745Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.8350507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:30.8351255Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:30.8352018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:30.8352693Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:30.8353322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:30.8353857Z fn() 2025-05-07T20:32:30.8354394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:30.8355013Z self.fn.run( 2025-05-07T20:32:30.8355494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.8356054Z kernel = self.compile( 2025-05-07T20:32:30.8356618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.8357304Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.8357712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.8357959Z 2025-05-07T20:32:30.8358180Z self = 2025-05-07T20:32:30.8359311Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.8360762Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac87a6d40>} 2025-05-07T20:32:30.8362165Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.8363274Z context = 2025-05-07T20:32:30.8363581Z 2025-05-07T20:32:30.8363753Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.8364299Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.8364778Z module_map=module_map) 2025-05-07T20:32:30.8365162Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.8365540Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:30.8365822Z E ^ 2025-05-07T20:32:30.8366303Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.8366779Z 2025-05-07T20:32:30.8367210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.8367744Z 2025-05-07T20:32:30.8367862Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.8368301Z self=, 2025-05-07T20:32:30.8368718Z T=1, 2025-05-07T20:32:30.8368917Z D=5120, 2025-05-07T20:32:30.8369175Z scale_ub=1200.0, 2025-05-07T20:32:30.8369411Z contiguous=True, 2025-05-07T20:32:30.8369650Z compiled=True, 2025-05-07T20:32:30.8369873Z ) 2025-05-07T20:32:30.9806808Z self = 2025-05-07T20:32:30.9807483Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:30.9807861Z 2025-05-07T20:32:30.9807966Z @given( 2025-05-07T20:32:30.9808199Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.9808512Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.9808923Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.9809256Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.9809588Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.9809889Z ) 2025-05-07T20:32:30.9810247Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.9810699Z def test_silu_mul_quant( 2025-05-07T20:32:30.9810948Z self, 2025-05-07T20:32:30.9811146Z T: int, 2025-05-07T20:32:30.9811343Z D: int, 2025-05-07T20:32:30.9811590Z scale_ub: Optional[float], 2025-05-07T20:32:30.9811950Z contiguous: bool, 2025-05-07T20:32:30.9812190Z compiled: bool, 2025-05-07T20:32:30.9812422Z ) -> None: 2025-05-07T20:32:30.9812640Z torch.manual_seed(2025) 2025-05-07T20:32:30.9812887Z 2025-05-07T20:32:30.9813160Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.9813512Z 2025-05-07T20:32:30.9813710Z x_sign = torch.sign(x) 2025-05-07T20:32:30.9814003Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.9814317Z x = x_sign * x_clamp 2025-05-07T20:32:30.9814563Z x0 = x[:, :D] 2025-05-07T20:32:30.9814784Z x1 = x[:, D:] 2025-05-07T20:32:30.9814995Z 2025-05-07T20:32:30.9815183Z if contiguous: 2025-05-07T20:32:30.9815419Z x0 = x0.contiguous() 2025-05-07T20:32:30.9815681Z x1 = x1.contiguous() 2025-05-07T20:32:30.9815925Z 2025-05-07T20:32:30.9816113Z if scale_ub is not None: 2025-05-07T20:32:30.9816391Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.9816736Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.9817047Z ) 2025-05-07T20:32:30.9817241Z else: 2025-05-07T20:32:30.9817455Z scale_ub_tensor = None 2025-05-07T20:32:30.9817707Z 2025-05-07T20:32:30.9817943Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.9818269Z op = silu_mul_quant 2025-05-07T20:32:30.9818525Z if compiled: 2025-05-07T20:32:30.9818773Z op = torch.compile(op) 2025-05-07T20:32:30.9819167Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.9819450Z 2025-05-07T20:32:30.9819645Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.9819817Z 2025-05-07T20:32:30.9819920Z moe/activation_test.py:117: 2025-05-07T20:32:30.9820226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.9820561Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.9820847Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.9821459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:30.9822047Z return fn(*args, **kwargs) 2025-05-07T20:32:30.9822722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.9823433Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.9823987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.9824686Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.9825436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.9825987Z kernel = self.compile( 2025-05-07T20:32:30.9826540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.9827256Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.9827666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.9827902Z 2025-05-07T20:32:30.9828119Z self = 2025-05-07T20:32:30.9829284Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.9830708Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac82a19e0>} 2025-05-07T20:32:30.9832119Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.9833184Z context = 2025-05-07T20:32:30.9833482Z 2025-05-07T20:32:30.9833657Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.9834196Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.9834675Z module_map=module_map) 2025-05-07T20:32:30.9835044Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.9835409Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.9835675Z E ^ 2025-05-07T20:32:30.9836152Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.9836620Z 2025-05-07T20:32:30.9837058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.9837593Z 2025-05-07T20:32:30.9837701Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.9838119Z self=, 2025-05-07T20:32:30.9838542Z T=1, 2025-05-07T20:32:30.9838727Z D=5120, 2025-05-07T20:32:30.9838919Z scale_ub=None, 2025-05-07T20:32:30.9839139Z contiguous=False, 2025-05-07T20:32:30.9839365Z compiled=True, 2025-05-07T20:32:30.9839564Z ) 2025-05-07T20:32:31.2045720Z self = 2025-05-07T20:32:31.2046358Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:31.2046783Z 2025-05-07T20:32:31.2046890Z @given( 2025-05-07T20:32:31.2047191Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.2047538Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.2047848Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.2048181Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.2048505Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.2048790Z ) 2025-05-07T20:32:31.2049143Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.2049590Z def test_silu_mul_quant( 2025-05-07T20:32:31.2049836Z self, 2025-05-07T20:32:31.2050029Z T: int, 2025-05-07T20:32:31.2050221Z D: int, 2025-05-07T20:32:31.2050438Z scale_ub: Optional[float], 2025-05-07T20:32:31.2050709Z contiguous: bool, 2025-05-07T20:32:31.2050948Z compiled: bool, 2025-05-07T20:32:31.2051176Z ) -> None: 2025-05-07T20:32:31.2051413Z torch.manual_seed(2025) 2025-05-07T20:32:31.2051674Z 2025-05-07T20:32:31.2052106Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.2052454Z 2025-05-07T20:32:31.2052641Z x_sign = torch.sign(x) 2025-05-07T20:32:31.2052927Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.2053323Z x = x_sign * x_clamp 2025-05-07T20:32:31.2053559Z x0 = x[:, :D] 2025-05-07T20:32:31.2053770Z x1 = x[:, D:] 2025-05-07T20:32:31.2053975Z 2025-05-07T20:32:31.2054159Z if contiguous: 2025-05-07T20:32:31.2054383Z x0 = x0.contiguous() 2025-05-07T20:32:31.2054709Z x1 = x1.contiguous() 2025-05-07T20:32:31.2054944Z 2025-05-07T20:32:31.2055128Z if scale_ub is not None: 2025-05-07T20:32:31.2055398Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.2055741Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.2056046Z ) 2025-05-07T20:32:31.2056237Z else: 2025-05-07T20:32:31.2056446Z scale_ub_tensor = None 2025-05-07T20:32:31.2056691Z 2025-05-07T20:32:31.2056920Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.2057235Z op = silu_mul_quant 2025-05-07T20:32:31.2057490Z if compiled: 2025-05-07T20:32:31.2057734Z op = torch.compile(op) 2025-05-07T20:32:31.2058029Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.2058305Z 2025-05-07T20:32:31.2058493Z y_fp8, y_scale = fn() 2025-05-07T20:32:31.2058777Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:31.2059072Z 2025-05-07T20:32:31.2059303Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.2059639Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:31.2059932Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:31.2060245Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:31.2060604Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:31.2060918Z 2025-05-07T20:32:31.2061114Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:31.2061313Z 2025-05-07T20:32:31.2061416Z moe/activation_test.py:126: 2025-05-07T20:32:31.2061714Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.2062053Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:31.2062381Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:31.2063186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:31.2063963Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:31.2064518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.2065260Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.2065968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:31.2066708Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:31.2067452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:31.2068106Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:31.2068719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:31.2069250Z fn() 2025-05-07T20:32:31.2069764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:31.2070358Z self.fn.run( 2025-05-07T20:32:31.2070835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.2071375Z kernel = self.compile( 2025-05-07T20:32:31.2071971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.2072641Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.2073083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.2073317Z 2025-05-07T20:32:31.2073529Z self = 2025-05-07T20:32:31.2074644Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.2076103Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac82a0680>} 2025-05-07T20:32:31.2077503Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.2078556Z context = 2025-05-07T20:32:31.2078854Z 2025-05-07T20:32:31.2079023Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.2079553Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.2080031Z module_map=module_map) 2025-05-07T20:32:31.2080407Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.2080761Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:31.2081034Z E ^ 2025-05-07T20:32:31.2081527Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.2081992Z 2025-05-07T20:32:31.2082425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.2082962Z 2025-05-07T20:32:31.2083070Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.2083501Z self=, 2025-05-07T20:32:31.2083925Z T=1, 2025-05-07T20:32:31.2084115Z D=5120, 2025-05-07T20:32:31.2084316Z scale_ub=None, 2025-05-07T20:32:31.2084542Z contiguous=True, 2025-05-07T20:32:31.2084770Z compiled=False, 2025-05-07T20:32:31.2084986Z ) 2025-05-07T20:32:31.3583336Z self = 2025-05-07T20:32:31.3583900Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:31.3584285Z 2025-05-07T20:32:31.3584433Z @given( 2025-05-07T20:32:31.3584878Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.3585317Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.3585732Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.3586172Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.3586598Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.3586962Z ) 2025-05-07T20:32:31.3587322Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.3587776Z def test_silu_mul_quant( 2025-05-07T20:32:31.3588024Z self, 2025-05-07T20:32:31.3588220Z T: int, 2025-05-07T20:32:31.3588420Z D: int, 2025-05-07T20:32:31.3588642Z scale_ub: Optional[float], 2025-05-07T20:32:31.3588916Z contiguous: bool, 2025-05-07T20:32:31.3589160Z compiled: bool, 2025-05-07T20:32:31.3589393Z ) -> None: 2025-05-07T20:32:31.3589612Z torch.manual_seed(2025) 2025-05-07T20:32:31.3589866Z 2025-05-07T20:32:31.3590147Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.3590496Z 2025-05-07T20:32:31.3590779Z x_sign = torch.sign(x) 2025-05-07T20:32:31.3591082Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.3591397Z x = x_sign * x_clamp 2025-05-07T20:32:31.3591645Z x0 = x[:, :D] 2025-05-07T20:32:31.3591927Z x1 = x[:, D:] 2025-05-07T20:32:31.3592136Z 2025-05-07T20:32:31.3592331Z if contiguous: 2025-05-07T20:32:31.3592570Z x0 = x0.contiguous() 2025-05-07T20:32:31.3592833Z x1 = x1.contiguous() 2025-05-07T20:32:31.3593081Z 2025-05-07T20:32:31.3593285Z if scale_ub is not None: 2025-05-07T20:32:31.3593633Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.3593970Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.3594287Z ) 2025-05-07T20:32:31.3594487Z else: 2025-05-07T20:32:31.3594703Z scale_ub_tensor = None 2025-05-07T20:32:31.3594963Z 2025-05-07T20:32:31.3595203Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.3595524Z op = silu_mul_quant 2025-05-07T20:32:31.3595793Z if compiled: 2025-05-07T20:32:31.3596049Z op = torch.compile(op) 2025-05-07T20:32:31.3596349Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.3596636Z 2025-05-07T20:32:31.3596837Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.3597006Z 2025-05-07T20:32:31.3597110Z moe/activation_test.py:117: 2025-05-07T20:32:31.3597420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.3597767Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.3598062Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.3598770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.3599514Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.3600074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.3600781Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.3601489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.3602068Z kernel = self.compile( 2025-05-07T20:32:31.3602626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.3603303Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.3603711Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.3603950Z 2025-05-07T20:32:31.3604165Z self = 2025-05-07T20:32:31.3605337Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.3606938Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac82a2b60>} 2025-05-07T20:32:31.3608335Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.3609391Z context = 2025-05-07T20:32:31.3609695Z 2025-05-07T20:32:31.3609866Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.3610414Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.3610898Z module_map=module_map) 2025-05-07T20:32:31.3611267Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.3611699Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.3612018Z E ^ 2025-05-07T20:32:31.3612495Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.3613030Z 2025-05-07T20:32:31.3613458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.3613995Z 2025-05-07T20:32:31.3614102Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.3614530Z self=, 2025-05-07T20:32:31.3615004Z T=128, 2025-05-07T20:32:31.3615202Z D=5120, 2025-05-07T20:32:31.3615403Z scale_ub=None, 2025-05-07T20:32:31.3615622Z contiguous=False, 2025-05-07T20:32:31.3615861Z compiled=True, 2025-05-07T20:32:31.3616070Z ) 2025-05-07T20:32:31.3616399Z self = 2025-05-07T20:32:31.3616913Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:31.3617190Z 2025-05-07T20:32:31.3617278Z @given( 2025-05-07T20:32:31.3617515Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.3617842Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.3618159Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.3625454Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.3625838Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.3626137Z ) 2025-05-07T20:32:31.3626516Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.3626975Z def test_silu_mul_quant( 2025-05-07T20:32:31.3627230Z self, 2025-05-07T20:32:31.3627435Z T: int, 2025-05-07T20:32:31.3627637Z D: int, 2025-05-07T20:32:31.3627867Z scale_ub: Optional[float], 2025-05-07T20:32:31.3628148Z contiguous: bool, 2025-05-07T20:32:31.3628398Z compiled: bool, 2025-05-07T20:32:31.3628631Z ) -> None: 2025-05-07T20:32:31.3628857Z torch.manual_seed(2025) 2025-05-07T20:32:31.3629113Z 2025-05-07T20:32:31.3629392Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.3629752Z 2025-05-07T20:32:31.3629954Z x_sign = torch.sign(x) 2025-05-07T20:32:31.3630250Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.3630575Z x = x_sign * x_clamp 2025-05-07T20:32:31.3630824Z x0 = x[:, :D] 2025-05-07T20:32:31.3631052Z x1 = x[:, D:] 2025-05-07T20:32:31.3631276Z 2025-05-07T20:32:31.3631471Z if contiguous: 2025-05-07T20:32:31.3631705Z x0 = x0.contiguous() 2025-05-07T20:32:31.3631974Z x1 = x1.contiguous() 2025-05-07T20:32:31.3632327Z 2025-05-07T20:32:31.3632525Z if scale_ub is not None: 2025-05-07T20:32:31.3632808Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.3633160Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.3633480Z ) 2025-05-07T20:32:31.3633674Z else: 2025-05-07T20:32:31.3633893Z scale_ub_tensor = None 2025-05-07T20:32:31.3634155Z 2025-05-07T20:32:31.3634389Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.3634715Z op = silu_mul_quant 2025-05-07T20:32:31.3634974Z if compiled: 2025-05-07T20:32:31.3635225Z op = torch.compile(op) 2025-05-07T20:32:31.3635532Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.3635823Z 2025-05-07T20:32:31.3636018Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.3636193Z 2025-05-07T20:32:31.3636298Z moe/activation_test.py:117: 2025-05-07T20:32:31.3636613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.3636954Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.3637249Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.3637883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:31.3638468Z return fn(*args, **kwargs) 2025-05-07T20:32:31.3639145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.3639900Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.3640456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.3641193Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.3641881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.3642436Z kernel = self.compile( 2025-05-07T20:32:31.3642996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.3643674Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.3644088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.3644326Z 2025-05-07T20:32:31.3644551Z self = 2025-05-07T20:32:31.3645680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.3647106Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac82a2de0>} 2025-05-07T20:32:31.3648505Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.3649573Z context = 2025-05-07T20:32:31.3649876Z 2025-05-07T20:32:31.3650054Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.3650596Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.3651079Z module_map=module_map) 2025-05-07T20:32:31.3651455Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.3651903Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.3652175Z E ^ 2025-05-07T20:32:31.3652661Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.3653128Z 2025-05-07T20:32:31.3653615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.3654152Z 2025-05-07T20:32:31.3654267Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.3654693Z self=, 2025-05-07T20:32:31.3655113Z T=128, 2025-05-07T20:32:31.3655313Z D=7168, 2025-05-07T20:32:31.3655511Z scale_ub=1200.0, 2025-05-07T20:32:31.3655747Z contiguous=False, 2025-05-07T20:32:31.3655982Z compiled=False, 2025-05-07T20:32:31.3656192Z ) 2025-05-07T20:32:31.4776936Z self = 2025-05-07T20:32:31.4777673Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:31.4778137Z 2025-05-07T20:32:31.4778253Z @given( 2025-05-07T20:32:31.4778562Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.4778967Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.4779283Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.4779616Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.4780094Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.4780393Z ) 2025-05-07T20:32:31.4780748Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.4781269Z def test_silu_mul_quant( 2025-05-07T20:32:31.4781519Z self, 2025-05-07T20:32:31.4781718Z T: int, 2025-05-07T20:32:31.4781926Z D: int, 2025-05-07T20:32:31.4782156Z scale_ub: Optional[float], 2025-05-07T20:32:31.4782430Z contiguous: bool, 2025-05-07T20:32:31.4782673Z compiled: bool, 2025-05-07T20:32:31.4782969Z ) -> None: 2025-05-07T20:32:31.4783187Z torch.manual_seed(2025) 2025-05-07T20:32:31.4783428Z 2025-05-07T20:32:31.4783706Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.4784063Z 2025-05-07T20:32:31.4784261Z x_sign = torch.sign(x) 2025-05-07T20:32:31.4784557Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.4784875Z x = x_sign * x_clamp 2025-05-07T20:32:31.4785116Z x0 = x[:, :D] 2025-05-07T20:32:31.4785338Z x1 = x[:, D:] 2025-05-07T20:32:31.4785549Z 2025-05-07T20:32:31.4785734Z if contiguous: 2025-05-07T20:32:31.4785982Z x0 = x0.contiguous() 2025-05-07T20:32:31.4786244Z x1 = x1.contiguous() 2025-05-07T20:32:31.4786487Z 2025-05-07T20:32:31.4786686Z if scale_ub is not None: 2025-05-07T20:32:31.4786963Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.4787299Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.4787621Z ) 2025-05-07T20:32:31.4787819Z else: 2025-05-07T20:32:31.4788034Z scale_ub_tensor = None 2025-05-07T20:32:31.4788286Z 2025-05-07T20:32:31.4788524Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.4788847Z op = silu_mul_quant 2025-05-07T20:32:31.4789098Z if compiled: 2025-05-07T20:32:31.4789350Z op = torch.compile(op) 2025-05-07T20:32:31.4789651Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.4789929Z 2025-05-07T20:32:31.4790124Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.4790291Z 2025-05-07T20:32:31.4790398Z moe/activation_test.py:117: 2025-05-07T20:32:31.4790699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.4791039Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.4791324Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.4792087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.4792795Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.4793427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.4794134Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.4794816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.4795367Z kernel = self.compile( 2025-05-07T20:32:31.4795923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.4796602Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.4797006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.4797248Z 2025-05-07T20:32:31.4797465Z self = 2025-05-07T20:32:31.4798593Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.4800067Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8aee57ccc0>} 2025-05-07T20:32:31.4801460Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.4802618Z context = 2025-05-07T20:32:31.4802917Z 2025-05-07T20:32:31.4803088Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.4803667Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.4804149Z module_map=module_map) 2025-05-07T20:32:31.4804523Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.4804887Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.4805152Z E ^ 2025-05-07T20:32:31.4805627Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.4806097Z 2025-05-07T20:32:31.4806789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.4807424Z 2025-05-07T20:32:31.4807535Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.4807950Z self=, 2025-05-07T20:32:31.4808358Z T=128, 2025-05-07T20:32:31.4808553Z D=5120, 2025-05-07T20:32:31.4808750Z scale_ub=None, 2025-05-07T20:32:31.4808963Z contiguous=False, 2025-05-07T20:32:31.4809192Z compiled=False, 2025-05-07T20:32:31.4809394Z ) 2025-05-07T20:32:31.4809718Z self = 2025-05-07T20:32:31.4810222Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:31.4810494Z 2025-05-07T20:32:31.4810582Z @given( 2025-05-07T20:32:31.4810806Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.4811121Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.4811434Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.4811763Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.4812159Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.4812446Z ) 2025-05-07T20:32:31.4812797Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.4813245Z def test_silu_mul_quant( 2025-05-07T20:32:31.4813491Z self, 2025-05-07T20:32:31.4813685Z T: int, 2025-05-07T20:32:31.4813881Z D: int, 2025-05-07T20:32:31.4814101Z scale_ub: Optional[float], 2025-05-07T20:32:31.4814465Z contiguous: bool, 2025-05-07T20:32:31.4814706Z compiled: bool, 2025-05-07T20:32:31.4814929Z ) -> None: 2025-05-07T20:32:31.4815150Z torch.manual_seed(2025) 2025-05-07T20:32:31.4815396Z 2025-05-07T20:32:31.4815666Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.4816013Z 2025-05-07T20:32:31.4816205Z x_sign = torch.sign(x) 2025-05-07T20:32:31.4816496Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.4816812Z x = x_sign * x_clamp 2025-05-07T20:32:31.4817060Z x0 = x[:, :D] 2025-05-07T20:32:31.4817285Z x1 = x[:, D:] 2025-05-07T20:32:31.4817489Z 2025-05-07T20:32:31.4817676Z if contiguous: 2025-05-07T20:32:31.4817915Z x0 = x0.contiguous() 2025-05-07T20:32:31.4818169Z x1 = x1.contiguous() 2025-05-07T20:32:31.4818409Z 2025-05-07T20:32:31.4818603Z if scale_ub is not None: 2025-05-07T20:32:31.4818874Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.4819210Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.4819526Z ) 2025-05-07T20:32:31.4819720Z else: 2025-05-07T20:32:31.4819999Z scale_ub_tensor = None 2025-05-07T20:32:31.4820255Z 2025-05-07T20:32:31.4820486Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.4820812Z op = silu_mul_quant 2025-05-07T20:32:31.4821150Z if compiled: 2025-05-07T20:32:31.4821418Z op = torch.compile(op) 2025-05-07T20:32:31.4821751Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.4822031Z 2025-05-07T20:32:31.4822224Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.4822395Z 2025-05-07T20:32:31.4822562Z moe/activation_test.py:117: 2025-05-07T20:32:31.4822863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.4823203Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.4823489Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.4824201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.4824910Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.4825458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.4826166Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.4826852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.4827402Z kernel = self.compile( 2025-05-07T20:32:31.4827954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.4828637Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.4829049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.4829288Z 2025-05-07T20:32:31.4829503Z self = 2025-05-07T20:32:31.4830619Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.4832049Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8aee58cea0>} 2025-05-07T20:32:31.4833441Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.4834507Z context = 2025-05-07T20:32:31.4834806Z 2025-05-07T20:32:31.4835022Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.4835565Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.4836046Z module_map=module_map) 2025-05-07T20:32:31.4836420Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.4836782Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.4837049Z E ^ 2025-05-07T20:32:31.4837525Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.4837988Z 2025-05-07T20:32:31.4838425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.4838961Z 2025-05-07T20:32:31.4839068Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.4839489Z self=, 2025-05-07T20:32:31.4839908Z T=128, 2025-05-07T20:32:31.4840097Z D=5120, 2025-05-07T20:32:31.4840292Z scale_ub=1200.0, 2025-05-07T20:32:31.4840518Z contiguous=True, 2025-05-07T20:32:31.4840790Z compiled=False, 2025-05-07T20:32:31.4841000Z ) 2025-05-07T20:32:31.6560340Z self = 2025-05-07T20:32:31.6561672Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:31.6562244Z 2025-05-07T20:32:31.6562352Z @given( 2025-05-07T20:32:31.6562641Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.6563037Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.6563432Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.6563873Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.6564193Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.6564474Z ) 2025-05-07T20:32:31.6564827Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.6565269Z def test_silu_mul_quant( 2025-05-07T20:32:31.6565509Z self, 2025-05-07T20:32:31.6565704Z T: int, 2025-05-07T20:32:31.6565897Z D: int, 2025-05-07T20:32:31.6566113Z scale_ub: Optional[float], 2025-05-07T20:32:31.6566383Z contiguous: bool, 2025-05-07T20:32:31.6566624Z compiled: bool, 2025-05-07T20:32:31.6566852Z ) -> None: 2025-05-07T20:32:31.6567069Z torch.manual_seed(2025) 2025-05-07T20:32:31.6567309Z 2025-05-07T20:32:31.6567577Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.6567919Z 2025-05-07T20:32:31.6568110Z x_sign = torch.sign(x) 2025-05-07T20:32:31.6568397Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.6568709Z x = x_sign * x_clamp 2025-05-07T20:32:31.6568945Z x0 = x[:, :D] 2025-05-07T20:32:31.6569155Z x1 = x[:, D:] 2025-05-07T20:32:31.6569365Z 2025-05-07T20:32:31.6569551Z if contiguous: 2025-05-07T20:32:31.6569781Z x0 = x0.contiguous() 2025-05-07T20:32:31.6570046Z x1 = x1.contiguous() 2025-05-07T20:32:31.6570287Z 2025-05-07T20:32:31.6570478Z if scale_ub is not None: 2025-05-07T20:32:31.6570753Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.6571090Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.6571407Z ) 2025-05-07T20:32:31.6571595Z else: 2025-05-07T20:32:31.6571806Z scale_ub_tensor = None 2025-05-07T20:32:31.6572124Z 2025-05-07T20:32:31.6572350Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.6572670Z op = silu_mul_quant 2025-05-07T20:32:31.6572935Z if compiled: 2025-05-07T20:32:31.6573182Z op = torch.compile(op) 2025-05-07T20:32:31.6573487Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.6573765Z 2025-05-07T20:32:31.6574029Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.6574200Z 2025-05-07T20:32:31.6574300Z moe/activation_test.py:117: 2025-05-07T20:32:31.6574601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.6574934Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.6575219Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.6575922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.6576631Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.6577171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.6577868Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.6578547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.6579095Z kernel = self.compile( 2025-05-07T20:32:31.6579641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.6580377Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.6580784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.6581056Z 2025-05-07T20:32:31.6581271Z self = 2025-05-07T20:32:31.6582383Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.6583843Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac822cc20>} 2025-05-07T20:32:31.6585433Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.6586489Z context = 2025-05-07T20:32:31.6586781Z 2025-05-07T20:32:31.6586949Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.6587481Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.6587954Z module_map=module_map) 2025-05-07T20:32:31.6588316Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.6588672Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.6588934Z E ^ 2025-05-07T20:32:31.6589409Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.6589870Z 2025-05-07T20:32:31.6590296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.6590825Z 2025-05-07T20:32:31.6590929Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.6591352Z self=, 2025-05-07T20:32:31.6591763Z T=1, 2025-05-07T20:32:31.6591947Z D=7168, 2025-05-07T20:32:31.6592138Z scale_ub=1200.0, 2025-05-07T20:32:31.6592362Z contiguous=True, 2025-05-07T20:32:31.6592580Z compiled=True, 2025-05-07T20:32:31.6592780Z ) 2025-05-07T20:32:31.6593104Z self = 2025-05-07T20:32:31.6593597Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:31.6593873Z 2025-05-07T20:32:31.6593954Z @given( 2025-05-07T20:32:31.6594184Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.6594545Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.6594858Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.6595189Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.6595524Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.6595807Z ) 2025-05-07T20:32:31.6596157Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.6596614Z def test_silu_mul_quant( 2025-05-07T20:32:31.6596853Z self, 2025-05-07T20:32:31.6597049Z T: int, 2025-05-07T20:32:31.6597247Z D: int, 2025-05-07T20:32:31.6597461Z scale_ub: Optional[float], 2025-05-07T20:32:31.6597737Z contiguous: bool, 2025-05-07T20:32:31.6597978Z compiled: bool, 2025-05-07T20:32:31.6598199Z ) -> None: 2025-05-07T20:32:31.6598414Z torch.manual_seed(2025) 2025-05-07T20:32:31.6598662Z 2025-05-07T20:32:31.6598929Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.6599276Z 2025-05-07T20:32:31.6599471Z x_sign = torch.sign(x) 2025-05-07T20:32:31.6599757Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.6600113Z x = x_sign * x_clamp 2025-05-07T20:32:31.6600355Z x0 = x[:, :D] 2025-05-07T20:32:31.6600570Z x1 = x[:, D:] 2025-05-07T20:32:31.6600773Z 2025-05-07T20:32:31.6600956Z if contiguous: 2025-05-07T20:32:31.6601228Z x0 = x0.contiguous() 2025-05-07T20:32:31.6601479Z x1 = x1.contiguous() 2025-05-07T20:32:31.6601719Z 2025-05-07T20:32:31.6601908Z if scale_ub is not None: 2025-05-07T20:32:31.6602213Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.6602560Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.6602912Z ) 2025-05-07T20:32:31.6603100Z else: 2025-05-07T20:32:31.6603308Z scale_ub_tensor = None 2025-05-07T20:32:31.6603560Z 2025-05-07T20:32:31.6603789Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.6604108Z op = silu_mul_quant 2025-05-07T20:32:31.6604360Z if compiled: 2025-05-07T20:32:31.6604605Z op = torch.compile(op) 2025-05-07T20:32:31.6604904Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.6605182Z 2025-05-07T20:32:31.6605376Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.6605549Z 2025-05-07T20:32:31.6605649Z moe/activation_test.py:117: 2025-05-07T20:32:31.6605947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.6606440Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.6606726Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.6615185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:31.6615814Z return fn(*args, **kwargs) 2025-05-07T20:32:31.6616502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.6617210Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.6617758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.6618457Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.6619141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.6619687Z kernel = self.compile( 2025-05-07T20:32:31.6620236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.6620909Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.6621319Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.6621552Z 2025-05-07T20:32:31.6621875Z self = 2025-05-07T20:32:31.6622994Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.6624417Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac822e2a0>} 2025-05-07T20:32:31.6625817Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.6626877Z context = 2025-05-07T20:32:31.6627172Z 2025-05-07T20:32:31.6627347Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.6627881Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.6628361Z module_map=module_map) 2025-05-07T20:32:31.6628796Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.6629156Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.6629416Z E ^ 2025-05-07T20:32:31.6629892Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.6630415Z 2025-05-07T20:32:31.6630848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.6631376Z 2025-05-07T20:32:31.6631486Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.6632020Z self=, 2025-05-07T20:32:31.6632432Z T=1, 2025-05-07T20:32:31.6632615Z D=7168, 2025-05-07T20:32:31.6632811Z scale_ub=1200.0, 2025-05-07T20:32:31.6633043Z contiguous=False, 2025-05-07T20:32:31.6633268Z compiled=True, 2025-05-07T20:32:31.6633477Z ) 2025-05-07T20:32:31.7949681Z self = 2025-05-07T20:32:31.7950407Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:31.7950780Z 2025-05-07T20:32:31.7950884Z @given( 2025-05-07T20:32:31.7951182Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.7951640Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.7952232Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.7952917Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.7953564Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.7954130Z ) 2025-05-07T20:32:31.7954824Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.7955719Z def test_silu_mul_quant( 2025-05-07T20:32:31.7956204Z self, 2025-05-07T20:32:31.7956579Z T: int, 2025-05-07T20:32:31.7956967Z D: int, 2025-05-07T20:32:31.7957395Z scale_ub: Optional[float], 2025-05-07T20:32:31.7957934Z contiguous: bool, 2025-05-07T20:32:31.7958398Z compiled: bool, 2025-05-07T20:32:31.7958838Z ) -> None: 2025-05-07T20:32:31.7959262Z torch.manual_seed(2025) 2025-05-07T20:32:31.7959736Z 2025-05-07T20:32:31.7960270Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.7960946Z 2025-05-07T20:32:31.7961317Z x_sign = torch.sign(x) 2025-05-07T20:32:31.7961748Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.7962060Z x = x_sign * x_clamp 2025-05-07T20:32:31.7962300Z x0 = x[:, :D] 2025-05-07T20:32:31.7962520Z x1 = x[:, D:] 2025-05-07T20:32:31.7962728Z 2025-05-07T20:32:31.7962908Z if contiguous: 2025-05-07T20:32:31.7963139Z x0 = x0.contiguous() 2025-05-07T20:32:31.7963515Z x1 = x1.contiguous() 2025-05-07T20:32:31.7963756Z 2025-05-07T20:32:31.7963945Z if scale_ub is not None: 2025-05-07T20:32:31.7964221Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.7964561Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.7964870Z ) 2025-05-07T20:32:31.7965063Z else: 2025-05-07T20:32:31.7965273Z scale_ub_tensor = None 2025-05-07T20:32:31.7965520Z 2025-05-07T20:32:31.7965753Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.7966072Z op = silu_mul_quant 2025-05-07T20:32:31.7966320Z if compiled: 2025-05-07T20:32:31.7966563Z op = torch.compile(op) 2025-05-07T20:32:31.7966861Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.7967131Z 2025-05-07T20:32:31.7967325Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.7967490Z 2025-05-07T20:32:31.7967592Z moe/activation_test.py:117: 2025-05-07T20:32:31.7967889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.7968220Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.7968569Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.7969140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:31.7969710Z return fn(*args, **kwargs) 2025-05-07T20:32:31.7970445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.7971154Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.7971695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.7972520Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.7973201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.7973744Z kernel = self.compile( 2025-05-07T20:32:31.7974293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.7974963Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.7975368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.7975601Z 2025-05-07T20:32:31.7975818Z self = 2025-05-07T20:32:31.7976924Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.7978344Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac822f9c0>} 2025-05-07T20:32:31.7979724Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.7980780Z context = 2025-05-07T20:32:31.7981073Z 2025-05-07T20:32:31.7981242Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.7981774Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.7982248Z module_map=module_map) 2025-05-07T20:32:31.7982614Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.7982973Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.7983232Z E ^ 2025-05-07T20:32:31.7983755Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.7984219Z 2025-05-07T20:32:31.7984646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.7985177Z 2025-05-07T20:32:31.7985278Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.7985694Z self=, 2025-05-07T20:32:31.7986102Z T=1, 2025-05-07T20:32:31.7986281Z D=7168, 2025-05-07T20:32:31.7986470Z scale_ub=None, 2025-05-07T20:32:31.7986685Z contiguous=False, 2025-05-07T20:32:31.7986909Z compiled=True, 2025-05-07T20:32:31.7987107Z ) 2025-05-07T20:32:31.8841266Z self = 2025-05-07T20:32:31.8842016Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:31.8842381Z 2025-05-07T20:32:31.8842490Z @given( 2025-05-07T20:32:31.8842791Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.8843164Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.8843476Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.8843906Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.8844240Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.8844528Z ) 2025-05-07T20:32:31.8844874Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.8845380Z def test_silu_mul_quant( 2025-05-07T20:32:31.8845620Z self, 2025-05-07T20:32:31.8845811Z T: int, 2025-05-07T20:32:31.8846002Z D: int, 2025-05-07T20:32:31.8846218Z scale_ub: Optional[float], 2025-05-07T20:32:31.8846482Z contiguous: bool, 2025-05-07T20:32:31.8846790Z compiled: bool, 2025-05-07T20:32:31.8847010Z ) -> None: 2025-05-07T20:32:31.8847219Z torch.manual_seed(2025) 2025-05-07T20:32:31.8847464Z 2025-05-07T20:32:31.8847737Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.8848082Z 2025-05-07T20:32:31.8848267Z x_sign = torch.sign(x) 2025-05-07T20:32:31.8848562Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.8848870Z x = x_sign * x_clamp 2025-05-07T20:32:31.8849106Z x0 = x[:, :D] 2025-05-07T20:32:31.8849321Z x1 = x[:, D:] 2025-05-07T20:32:31.8849528Z 2025-05-07T20:32:31.8849714Z if contiguous: 2025-05-07T20:32:31.8849949Z x0 = x0.contiguous() 2025-05-07T20:32:31.8850208Z x1 = x1.contiguous() 2025-05-07T20:32:31.8850440Z 2025-05-07T20:32:31.8850629Z if scale_ub is not None: 2025-05-07T20:32:31.8850901Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.8851235Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.8851545Z ) 2025-05-07T20:32:31.8851733Z else: 2025-05-07T20:32:31.8852020Z scale_ub_tensor = None 2025-05-07T20:32:31.8852278Z 2025-05-07T20:32:31.8852510Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.8852823Z op = silu_mul_quant 2025-05-07T20:32:31.8853072Z if compiled: 2025-05-07T20:32:31.8853316Z op = torch.compile(op) 2025-05-07T20:32:31.8853610Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.8853884Z 2025-05-07T20:32:31.8854076Z y_fp8, y_scale = fn() 2025-05-07T20:32:31.8854361Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:31.8854648Z 2025-05-07T20:32:31.8854884Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.8855219Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:31.8855509Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:31.8855826Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:31.8856183Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:31.8856493Z 2025-05-07T20:32:31.8856764Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:31.8856964Z 2025-05-07T20:32:31.8857063Z moe/activation_test.py:126: 2025-05-07T20:32:31.8857362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.8857694Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:31.8858024Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:31.8858832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:31.8859605Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:31.8860158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.8860856Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.8861560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:31.8862289Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:31.8863072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:31.8863727Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:31.8864402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:31.8864925Z fn() 2025-05-07T20:32:31.8865432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:31.8866022Z self.fn.run( 2025-05-07T20:32:31.8866532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.8867067Z kernel = self.compile( 2025-05-07T20:32:31.8867616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.8868282Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.8868689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.8868920Z 2025-05-07T20:32:31.8869144Z self = 2025-05-07T20:32:31.8870258Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.8871698Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac8130b80>} 2025-05-07T20:32:31.8873103Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.8874153Z context = 2025-05-07T20:32:31.8874447Z 2025-05-07T20:32:31.8874613Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.8875139Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.8875611Z module_map=module_map) 2025-05-07T20:32:31.8875978Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.8876332Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:31.8876595Z E ^ 2025-05-07T20:32:31.8877072Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.8877534Z 2025-05-07T20:32:31.8878005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.8878538Z 2025-05-07T20:32:31.8878643Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.8879062Z self=, 2025-05-07T20:32:31.8879472Z T=1, 2025-05-07T20:32:31.8879652Z D=5120, 2025-05-07T20:32:31.8879845Z scale_ub=1200.0, 2025-05-07T20:32:31.8880071Z contiguous=False, 2025-05-07T20:32:31.8880295Z compiled=True, 2025-05-07T20:32:31.8880496Z ) 2025-05-07T20:32:32.0432479Z self = 2025-05-07T20:32:32.0433941Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:32.0434698Z 2025-05-07T20:32:32.0434921Z @given( 2025-05-07T20:32:32.0435542Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.0436193Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.0436805Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.0437478Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.0438143Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.0438711Z ) 2025-05-07T20:32:32.0439620Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.0440525Z def test_silu_mul_quant( 2025-05-07T20:32:32.0441008Z self, 2025-05-07T20:32:32.0441507Z T: int, 2025-05-07T20:32:32.0441753Z D: int, 2025-05-07T20:32:32.0441971Z scale_ub: Optional[float], 2025-05-07T20:32:32.0442248Z contiguous: bool, 2025-05-07T20:32:32.0442492Z compiled: bool, 2025-05-07T20:32:32.0442717Z ) -> None: 2025-05-07T20:32:32.0442935Z torch.manual_seed(2025) 2025-05-07T20:32:32.0443246Z 2025-05-07T20:32:32.0443528Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.0443879Z 2025-05-07T20:32:32.0444074Z x_sign = torch.sign(x) 2025-05-07T20:32:32.0444377Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.0444690Z x = x_sign * x_clamp 2025-05-07T20:32:32.0444934Z x0 = x[:, :D] 2025-05-07T20:32:32.0445158Z x1 = x[:, D:] 2025-05-07T20:32:32.0445364Z 2025-05-07T20:32:32.0445557Z if contiguous: 2025-05-07T20:32:32.0445791Z x0 = x0.contiguous() 2025-05-07T20:32:32.0446048Z x1 = x1.contiguous() 2025-05-07T20:32:32.0446295Z 2025-05-07T20:32:32.0446491Z if scale_ub is not None: 2025-05-07T20:32:32.0446764Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.0447107Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.0447425Z ) 2025-05-07T20:32:32.0447618Z else: 2025-05-07T20:32:32.0447837Z scale_ub_tensor = None 2025-05-07T20:32:32.0448097Z 2025-05-07T20:32:32.0448327Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.0448648Z op = silu_mul_quant 2025-05-07T20:32:32.0448908Z if compiled: 2025-05-07T20:32:32.0449161Z op = torch.compile(op) 2025-05-07T20:32:32.0449459Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.0449737Z 2025-05-07T20:32:32.0449933Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.0450099Z 2025-05-07T20:32:32.0450200Z moe/activation_test.py:117: 2025-05-07T20:32:32.0450503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.0450845Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.0451127Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.0451720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.0452400Z return fn(*args, **kwargs) 2025-05-07T20:32:32.0453079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.0453854Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.0454408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.0455114Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.0455802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.0456348Z kernel = self.compile( 2025-05-07T20:32:32.0456902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.0457580Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.0457983Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.0458232Z 2025-05-07T20:32:32.0458446Z self = 2025-05-07T20:32:32.0459614Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.0461037Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac8131e40>} 2025-05-07T20:32:32.0462469Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.0463522Z context = 2025-05-07T20:32:32.0463862Z 2025-05-07T20:32:32.0464033Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.0464572Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.0465058Z module_map=module_map) 2025-05-07T20:32:32.0465428Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.0465794Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.0466058Z E ^ 2025-05-07T20:32:32.0466533Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.0467003Z 2025-05-07T20:32:32.0467433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.0467964Z 2025-05-07T20:32:32.0468077Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.0468504Z self=, 2025-05-07T20:32:32.0468917Z T=1, 2025-05-07T20:32:32.0469107Z D=5120, 2025-05-07T20:32:32.0469304Z scale_ub=1200.0, 2025-05-07T20:32:32.0469528Z contiguous=False, 2025-05-07T20:32:32.0469760Z compiled=False, 2025-05-07T20:32:32.0469974Z ) 2025-05-07T20:32:32.0470295Z self = 2025-05-07T20:32:32.0470806Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:32.0471087Z 2025-05-07T20:32:32.0471173Z @given( 2025-05-07T20:32:32.0471407Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.0471726Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.0472036Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.0472373Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.0472702Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.0472994Z ) 2025-05-07T20:32:32.0473353Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.0473803Z def test_silu_mul_quant( 2025-05-07T20:32:32.0474052Z self, 2025-05-07T20:32:32.0474252Z T: int, 2025-05-07T20:32:32.0474503Z D: int, 2025-05-07T20:32:32.0474728Z scale_ub: Optional[float], 2025-05-07T20:32:32.0475001Z contiguous: bool, 2025-05-07T20:32:32.0475241Z compiled: bool, 2025-05-07T20:32:32.0475475Z ) -> None: 2025-05-07T20:32:32.0475689Z torch.manual_seed(2025) 2025-05-07T20:32:32.0475931Z 2025-05-07T20:32:32.0476206Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.0476556Z 2025-05-07T20:32:32.0476749Z x_sign = torch.sign(x) 2025-05-07T20:32:32.0477044Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.0477359Z x = x_sign * x_clamp 2025-05-07T20:32:32.0477603Z x0 = x[:, :D] 2025-05-07T20:32:32.0477820Z x1 = x[:, D:] 2025-05-07T20:32:32.0478035Z 2025-05-07T20:32:32.0478224Z if contiguous: 2025-05-07T20:32:32.0478455Z x0 = x0.contiguous() 2025-05-07T20:32:32.0478715Z x1 = x1.contiguous() 2025-05-07T20:32:32.0478955Z 2025-05-07T20:32:32.0479146Z if scale_ub is not None: 2025-05-07T20:32:32.0479419Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.0479807Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.0480118Z ) 2025-05-07T20:32:32.0480315Z else: 2025-05-07T20:32:32.0480531Z scale_ub_tensor = None 2025-05-07T20:32:32.0480780Z 2025-05-07T20:32:32.0481075Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.0481398Z op = silu_mul_quant 2025-05-07T20:32:32.0481657Z if compiled: 2025-05-07T20:32:32.0481908Z op = torch.compile(op) 2025-05-07T20:32:32.0482259Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.0482580Z 2025-05-07T20:32:32.0482772Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.0482942Z 2025-05-07T20:32:32.0483042Z moe/activation_test.py:117: 2025-05-07T20:32:32.0483348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.0483688Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.0483967Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.0484678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.0485385Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.0485937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.0486644Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.0487331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.0487882Z kernel = self.compile( 2025-05-07T20:32:32.0488433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.0489116Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.0489526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.0489763Z 2025-05-07T20:32:32.0489978Z self = 2025-05-07T20:32:32.0491095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.0492565Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac8132ac0>} 2025-05-07T20:32:32.0493957Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.0495069Z context = 2025-05-07T20:32:32.0495366Z 2025-05-07T20:32:32.0495550Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.0496088Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.0496573Z module_map=module_map) 2025-05-07T20:32:32.0496949Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.0497308Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.0497579Z E ^ 2025-05-07T20:32:32.0504866Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.0505342Z 2025-05-07T20:32:32.0505780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.0506697Z 2025-05-07T20:32:32.0506807Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.0507240Z self=, 2025-05-07T20:32:32.0507648Z T=16384, 2025-05-07T20:32:32.0507951Z D=5120, 2025-05-07T20:32:32.0508153Z scale_ub=1200.0, 2025-05-07T20:32:32.0508380Z contiguous=False, 2025-05-07T20:32:32.0508607Z compiled=True, 2025-05-07T20:32:32.0508816Z ) 2025-05-07T20:32:32.1366318Z self = 2025-05-07T20:32:32.1367215Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:32.1367615Z 2025-05-07T20:32:32.1367720Z @given( 2025-05-07T20:32:32.1368028Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.1368517Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.1368826Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.1369159Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.1369491Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.1369781Z ) 2025-05-07T20:32:32.1370138Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.1370592Z def test_silu_mul_quant( 2025-05-07T20:32:32.1370841Z self, 2025-05-07T20:32:32.1371040Z T: int, 2025-05-07T20:32:32.1371241Z D: int, 2025-05-07T20:32:32.1371465Z scale_ub: Optional[float], 2025-05-07T20:32:32.1371740Z contiguous: bool, 2025-05-07T20:32:32.1372049Z compiled: bool, 2025-05-07T20:32:32.1372276Z ) -> None: 2025-05-07T20:32:32.1372494Z torch.manual_seed(2025) 2025-05-07T20:32:32.1372740Z 2025-05-07T20:32:32.1373013Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.1373362Z 2025-05-07T20:32:32.1373558Z x_sign = torch.sign(x) 2025-05-07T20:32:32.1373850Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.1374167Z x = x_sign * x_clamp 2025-05-07T20:32:32.1374415Z x0 = x[:, :D] 2025-05-07T20:32:32.1374633Z x1 = x[:, D:] 2025-05-07T20:32:32.1374849Z 2025-05-07T20:32:32.1375040Z if contiguous: 2025-05-07T20:32:32.1375276Z x0 = x0.contiguous() 2025-05-07T20:32:32.1375541Z x1 = x1.contiguous() 2025-05-07T20:32:32.1375782Z 2025-05-07T20:32:32.1375970Z if scale_ub is not None: 2025-05-07T20:32:32.1376250Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.1376592Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.1376902Z ) 2025-05-07T20:32:32.1377099Z else: 2025-05-07T20:32:32.1377312Z scale_ub_tensor = None 2025-05-07T20:32:32.1377573Z 2025-05-07T20:32:32.1377808Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.1378129Z op = silu_mul_quant 2025-05-07T20:32:32.1378388Z if compiled: 2025-05-07T20:32:32.1378635Z op = torch.compile(op) 2025-05-07T20:32:32.1379030Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.1379315Z 2025-05-07T20:32:32.1379510Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.1379679Z 2025-05-07T20:32:32.1379783Z moe/activation_test.py:117: 2025-05-07T20:32:32.1380088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.1380424Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.1380721Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.1381294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.1381869Z return fn(*args, **kwargs) 2025-05-07T20:32:32.1382547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.1383260Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.1383818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.1384522Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.1385282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.1385840Z kernel = self.compile( 2025-05-07T20:32:32.1386396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.1387122Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.1387535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.1387770Z 2025-05-07T20:32:32.1387988Z self = 2025-05-07T20:32:32.1389151Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.1390572Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917e4c180>} 2025-05-07T20:32:32.1391965Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.1393023Z context = 2025-05-07T20:32:32.1393319Z 2025-05-07T20:32:32.1393495Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.1394034Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.1394520Z module_map=module_map) 2025-05-07T20:32:32.1394894Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.1395259Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.1395520Z E ^ 2025-05-07T20:32:32.1396000Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.1396464Z 2025-05-07T20:32:32.1396898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.1397432Z 2025-05-07T20:32:32.1397540Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.1397956Z self=, 2025-05-07T20:32:32.1398371Z T=2048, 2025-05-07T20:32:32.1398564Z D=7168, 2025-05-07T20:32:32.1398764Z scale_ub=1200.0, 2025-05-07T20:32:32.1398991Z contiguous=False, 2025-05-07T20:32:32.1399222Z compiled=True, 2025-05-07T20:32:32.1399426Z ) 2025-05-07T20:32:32.1399802Z self = 2025-05-07T20:32:32.1400317Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:32.1400601Z 2025-05-07T20:32:32.1400681Z @given( 2025-05-07T20:32:32.1400916Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.1401233Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.1401543Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.1401904Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.1402268Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.1402556Z ) 2025-05-07T20:32:32.1402908Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.1403362Z def test_silu_mul_quant( 2025-05-07T20:32:32.1403611Z self, 2025-05-07T20:32:32.1403807Z T: int, 2025-05-07T20:32:32.1404014Z D: int, 2025-05-07T20:32:32.1404239Z scale_ub: Optional[float], 2025-05-07T20:32:32.1404511Z contiguous: bool, 2025-05-07T20:32:32.1404761Z compiled: bool, 2025-05-07T20:32:32.1404989Z ) -> None: 2025-05-07T20:32:32.1405209Z torch.manual_seed(2025) 2025-05-07T20:32:32.1405503Z 2025-05-07T20:32:32.1405780Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.1406294Z 2025-05-07T20:32:32.1406488Z x_sign = torch.sign(x) 2025-05-07T20:32:32.1406858Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.1407174Z x = x_sign * x_clamp 2025-05-07T20:32:32.1407416Z x0 = x[:, :D] 2025-05-07T20:32:32.1407637Z x1 = x[:, D:] 2025-05-07T20:32:32.1407845Z 2025-05-07T20:32:32.1408031Z if contiguous: 2025-05-07T20:32:32.1408271Z x0 = x0.contiguous() 2025-05-07T20:32:32.1408656Z x1 = x1.contiguous() 2025-05-07T20:32:32.1408897Z 2025-05-07T20:32:32.1409090Z if scale_ub is not None: 2025-05-07T20:32:32.1409366Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.1409705Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.1410021Z ) 2025-05-07T20:32:32.1410216Z else: 2025-05-07T20:32:32.1410427Z scale_ub_tensor = None 2025-05-07T20:32:32.1410679Z 2025-05-07T20:32:32.1410912Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.1411233Z op = silu_mul_quant 2025-05-07T20:32:32.1411488Z if compiled: 2025-05-07T20:32:32.1411733Z op = torch.compile(op) 2025-05-07T20:32:32.1412106Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.1412382Z 2025-05-07T20:32:32.1412578Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.1412744Z 2025-05-07T20:32:32.1412852Z moe/activation_test.py:117: 2025-05-07T20:32:32.1413155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.1413491Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.1413781Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.1414353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.1414924Z return fn(*args, **kwargs) 2025-05-07T20:32:32.1415603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.1416310Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.1416858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.1417551Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.1418238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.1418795Z kernel = self.compile( 2025-05-07T20:32:32.1419419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.1420100Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.1420514Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.1420751Z 2025-05-07T20:32:32.1420966Z self = 2025-05-07T20:32:32.1422076Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.1423499Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917e4cea0>} 2025-05-07T20:32:32.1424892Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.1425952Z context = 2025-05-07T20:32:32.1426249Z 2025-05-07T20:32:32.1426496Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.1427032Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.1427549Z module_map=module_map) 2025-05-07T20:32:32.1427923Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.1428279Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.1428548Z E ^ 2025-05-07T20:32:32.1429028Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.1429560Z 2025-05-07T20:32:32.1429992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.1430525Z 2025-05-07T20:32:32.2578448Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.2578960Z self=, 2025-05-07T20:32:32.2579598Z T=1, 2025-05-07T20:32:32.2579890Z D=5120, 2025-05-07T20:32:32.2580150Z scale_ub=None, 2025-05-07T20:32:32.2580438Z contiguous=False, 2025-05-07T20:32:32.2580715Z compiled=False, 2025-05-07T20:32:32.2580925Z ) 2025-05-07T20:32:32.2581243Z self = 2025-05-07T20:32:32.2581738Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:32.2582009Z 2025-05-07T20:32:32.2582086Z @given( 2025-05-07T20:32:32.2582319Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.2582640Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.2582950Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.2583286Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.2583611Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.2583899Z ) 2025-05-07T20:32:32.2584252Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.2584695Z def test_silu_mul_quant( 2025-05-07T20:32:32.2584944Z self, 2025-05-07T20:32:32.2585141Z T: int, 2025-05-07T20:32:32.2585340Z D: int, 2025-05-07T20:32:32.2585563Z scale_ub: Optional[float], 2025-05-07T20:32:32.2585833Z contiguous: bool, 2025-05-07T20:32:32.2586069Z compiled: bool, 2025-05-07T20:32:32.2586295Z ) -> None: 2025-05-07T20:32:32.2586513Z torch.manual_seed(2025) 2025-05-07T20:32:32.2586757Z 2025-05-07T20:32:32.2587035Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.2587381Z 2025-05-07T20:32:32.2587576Z x_sign = torch.sign(x) 2025-05-07T20:32:32.2587862Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.2588286Z x = x_sign * x_clamp 2025-05-07T20:32:32.2588532Z x0 = x[:, :D] 2025-05-07T20:32:32.2588744Z x1 = x[:, D:] 2025-05-07T20:32:32.2588951Z 2025-05-07T20:32:32.2589142Z if contiguous: 2025-05-07T20:32:32.2589371Z x0 = x0.contiguous() 2025-05-07T20:32:32.2589629Z x1 = x1.contiguous() 2025-05-07T20:32:32.2589870Z 2025-05-07T20:32:32.2590060Z if scale_ub is not None: 2025-05-07T20:32:32.2590336Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.2590672Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.2590982Z ) 2025-05-07T20:32:32.2591181Z else: 2025-05-07T20:32:32.2591400Z scale_ub_tensor = None 2025-05-07T20:32:32.2591653Z 2025-05-07T20:32:32.2591888Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.2592208Z op = silu_mul_quant 2025-05-07T20:32:32.2592465Z if compiled: 2025-05-07T20:32:32.2592712Z op = torch.compile(op) 2025-05-07T20:32:32.2593012Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.2593288Z 2025-05-07T20:32:32.2593546Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.2593720Z 2025-05-07T20:32:32.2593821Z moe/activation_test.py:117: 2025-05-07T20:32:32.2594119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.2594508Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.2594793Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.2595503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.2596211Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.2596820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.2597528Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.2598211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.2598764Z kernel = self.compile( 2025-05-07T20:32:32.2599318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.2600001Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.2600409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.2600645Z 2025-05-07T20:32:32.2600855Z self = 2025-05-07T20:32:32.2601974Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.2603404Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917e4de40>} 2025-05-07T20:32:32.2604797Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.2605861Z context = 2025-05-07T20:32:32.2606348Z 2025-05-07T20:32:32.2606523Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.2607064Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.2607544Z module_map=module_map) 2025-05-07T20:32:32.2607910Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.2608277Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.2608542Z E ^ 2025-05-07T20:32:32.2609083Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.2609555Z 2025-05-07T20:32:32.2609987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.2610524Z 2025-05-07T20:32:32.2610628Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.2611051Z self=, 2025-05-07T20:32:32.2611462Z T=4096, 2025-05-07T20:32:32.2611655Z D=7168, 2025-05-07T20:32:32.2611903Z scale_ub=1200.0, 2025-05-07T20:32:32.2612129Z contiguous=False, 2025-05-07T20:32:32.2612359Z compiled=False, 2025-05-07T20:32:32.2612568Z ) 2025-05-07T20:32:32.2612894Z self = 2025-05-07T20:32:32.2613403Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:32.2613697Z 2025-05-07T20:32:32.2613776Z @given( 2025-05-07T20:32:32.2614009Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.2614391Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.2614706Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.2615039Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.2615367Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.2615719Z ) 2025-05-07T20:32:32.2616078Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.2616526Z def test_silu_mul_quant( 2025-05-07T20:32:32.2616771Z self, 2025-05-07T20:32:32.2616969Z T: int, 2025-05-07T20:32:32.2617234Z D: int, 2025-05-07T20:32:32.2617451Z scale_ub: Optional[float], 2025-05-07T20:32:32.2617725Z contiguous: bool, 2025-05-07T20:32:32.2617968Z compiled: bool, 2025-05-07T20:32:32.2618189Z ) -> None: 2025-05-07T20:32:32.2618410Z torch.manual_seed(2025) 2025-05-07T20:32:32.2618656Z 2025-05-07T20:32:32.2618925Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.2619278Z 2025-05-07T20:32:32.2619473Z x_sign = torch.sign(x) 2025-05-07T20:32:32.2619764Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.2620083Z x = x_sign * x_clamp 2025-05-07T20:32:32.2620329Z x0 = x[:, :D] 2025-05-07T20:32:32.2620543Z x1 = x[:, D:] 2025-05-07T20:32:32.2620754Z 2025-05-07T20:32:32.2620942Z if contiguous: 2025-05-07T20:32:32.2621172Z x0 = x0.contiguous() 2025-05-07T20:32:32.2621436Z x1 = x1.contiguous() 2025-05-07T20:32:32.2621699Z 2025-05-07T20:32:32.2621923Z if scale_ub is not None: 2025-05-07T20:32:32.2622196Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.2622535Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.2622849Z ) 2025-05-07T20:32:32.2623043Z else: 2025-05-07T20:32:32.2623257Z scale_ub_tensor = None 2025-05-07T20:32:32.2623513Z 2025-05-07T20:32:32.2623746Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.2624068Z op = silu_mul_quant 2025-05-07T20:32:32.2624328Z if compiled: 2025-05-07T20:32:32.2624574Z op = torch.compile(op) 2025-05-07T20:32:32.2624880Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.2625159Z 2025-05-07T20:32:32.2625351Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.2625522Z 2025-05-07T20:32:32.2625622Z moe/activation_test.py:117: 2025-05-07T20:32:32.2625923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.2626264Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.2626545Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.2627306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.2628014Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.2628563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.2629262Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.2629947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.2630495Z kernel = self.compile( 2025-05-07T20:32:32.2631043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.2631721Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.2632132Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.2632368Z 2025-05-07T20:32:32.2632583Z self = 2025-05-07T20:32:32.2633737Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.2635157Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917e4f380>} 2025-05-07T20:32:32.2636585Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.2637682Z context = 2025-05-07T20:32:32.2637975Z 2025-05-07T20:32:32.2638144Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.2638688Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.2639162Z module_map=module_map) 2025-05-07T20:32:32.2639533Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.2639890Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.2640158Z E ^ 2025-05-07T20:32:32.2640638Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.2641108Z 2025-05-07T20:32:32.2641539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.2642072Z 2025-05-07T20:32:32.2642180Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.2642643Z self=, 2025-05-07T20:32:32.2643066Z T=16384, 2025-05-07T20:32:32.2643259Z D=7168, 2025-05-07T20:32:32.2643453Z scale_ub=None, 2025-05-07T20:32:32.2643674Z contiguous=True, 2025-05-07T20:32:32.2643897Z compiled=True, 2025-05-07T20:32:32.2644101Z ) 2025-05-07T20:32:32.4387410Z self = 2025-05-07T20:32:32.4388164Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:32.4388591Z 2025-05-07T20:32:32.4388702Z @given( 2025-05-07T20:32:32.4389010Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.4389336Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.4389639Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.4389968Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.4390289Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.4390577Z ) 2025-05-07T20:32:32.4390929Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.4391367Z def test_silu_mul_quant( 2025-05-07T20:32:32.4391726Z self, 2025-05-07T20:32:32.4391934Z T: int, 2025-05-07T20:32:32.4392132Z D: int, 2025-05-07T20:32:32.4392361Z scale_ub: Optional[float], 2025-05-07T20:32:32.4392658Z contiguous: bool, 2025-05-07T20:32:32.4392908Z compiled: bool, 2025-05-07T20:32:32.4393141Z ) -> None: 2025-05-07T20:32:32.4393367Z torch.manual_seed(2025) 2025-05-07T20:32:32.4393620Z 2025-05-07T20:32:32.4393914Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.4394294Z 2025-05-07T20:32:32.4394488Z x_sign = torch.sign(x) 2025-05-07T20:32:32.4394799Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.4395138Z x = x_sign * x_clamp 2025-05-07T20:32:32.4395395Z x0 = x[:, :D] 2025-05-07T20:32:32.4395618Z x1 = x[:, D:] 2025-05-07T20:32:32.4395834Z 2025-05-07T20:32:32.4396024Z if contiguous: 2025-05-07T20:32:32.4396261Z x0 = x0.contiguous() 2025-05-07T20:32:32.4396540Z x1 = x1.contiguous() 2025-05-07T20:32:32.4396797Z 2025-05-07T20:32:32.4396992Z if scale_ub is not None: 2025-05-07T20:32:32.4397353Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.4397724Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.4398060Z ) 2025-05-07T20:32:32.4398258Z else: 2025-05-07T20:32:32.4398554Z scale_ub_tensor = None 2025-05-07T20:32:32.4405263Z 2025-05-07T20:32:32.4405541Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.4405866Z op = silu_mul_quant 2025-05-07T20:32:32.4406361Z if compiled: 2025-05-07T20:32:32.4406610Z op = torch.compile(op) 2025-05-07T20:32:32.4407026Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.4407302Z 2025-05-07T20:32:32.4407496Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.4407666Z 2025-05-07T20:32:32.4407767Z moe/activation_test.py:117: 2025-05-07T20:32:32.4408069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.4408401Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.4408686Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.4409276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.4409855Z return fn(*args, **kwargs) 2025-05-07T20:32:32.4410527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.4411233Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.4411779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.4412523Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.4413199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.4413741Z kernel = self.compile( 2025-05-07T20:32:32.4414295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.4414962Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.4415366Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.4415601Z 2025-05-07T20:32:32.4415815Z self = 2025-05-07T20:32:32.4416927Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.4418420Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917dac4a0>} 2025-05-07T20:32:32.4419806Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.4420859Z context = 2025-05-07T20:32:32.4421151Z 2025-05-07T20:32:32.4421325Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.4421854Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.4422330Z module_map=module_map) 2025-05-07T20:32:32.4422698Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.4423060Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.4423316Z E ^ 2025-05-07T20:32:32.4423792Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.4424255Z 2025-05-07T20:32:32.4424750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.4425281Z 2025-05-07T20:32:32.4425384Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.4425805Z self=, 2025-05-07T20:32:32.4426274Z T=4096, 2025-05-07T20:32:32.4426464Z D=5120, 2025-05-07T20:32:32.4426654Z scale_ub=None, 2025-05-07T20:32:32.4426874Z contiguous=False, 2025-05-07T20:32:32.4427102Z compiled=True, 2025-05-07T20:32:32.4427299Z ) 2025-05-07T20:32:32.4427623Z self = 2025-05-07T20:32:32.4428173Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:32.4428455Z 2025-05-07T20:32:32.4428533Z @given( 2025-05-07T20:32:32.4428765Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.4429082Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.4429387Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.4429727Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.4430058Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.4430343Z ) 2025-05-07T20:32:32.4430701Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.4431153Z def test_silu_mul_quant( 2025-05-07T20:32:32.4431397Z self, 2025-05-07T20:32:32.4431585Z T: int, 2025-05-07T20:32:32.4431788Z D: int, 2025-05-07T20:32:32.4432003Z scale_ub: Optional[float], 2025-05-07T20:32:32.4432271Z contiguous: bool, 2025-05-07T20:32:32.4432521Z compiled: bool, 2025-05-07T20:32:32.4432747Z ) -> None: 2025-05-07T20:32:32.4432957Z torch.manual_seed(2025) 2025-05-07T20:32:32.4433204Z 2025-05-07T20:32:32.4433481Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.4433819Z 2025-05-07T20:32:32.4434012Z x_sign = torch.sign(x) 2025-05-07T20:32:32.4434307Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.4434620Z x = x_sign * x_clamp 2025-05-07T20:32:32.4434858Z x0 = x[:, :D] 2025-05-07T20:32:32.4435074Z x1 = x[:, D:] 2025-05-07T20:32:32.4435280Z 2025-05-07T20:32:32.4435463Z if contiguous: 2025-05-07T20:32:32.4435696Z x0 = x0.contiguous() 2025-05-07T20:32:32.4435954Z x1 = x1.contiguous() 2025-05-07T20:32:32.4436191Z 2025-05-07T20:32:32.4436381Z if scale_ub is not None: 2025-05-07T20:32:32.4436651Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.4436990Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.4437299Z ) 2025-05-07T20:32:32.4437494Z else: 2025-05-07T20:32:32.4437700Z scale_ub_tensor = None 2025-05-07T20:32:32.4437954Z 2025-05-07T20:32:32.4438238Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.4438554Z op = silu_mul_quant 2025-05-07T20:32:32.4438806Z if compiled: 2025-05-07T20:32:32.4439061Z op = torch.compile(op) 2025-05-07T20:32:32.4439353Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.4439629Z 2025-05-07T20:32:32.4439824Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.4439987Z 2025-05-07T20:32:32.4440090Z moe/activation_test.py:117: 2025-05-07T20:32:32.4440386Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.4440723Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.4441005Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.4441570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.4442137Z return fn(*args, **kwargs) 2025-05-07T20:32:32.4442814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.4443516Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.4444104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.4444801Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.4445523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.4446061Z kernel = self.compile( 2025-05-07T20:32:32.4446610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.4447320Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.4447724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.4447956Z 2025-05-07T20:32:32.4448170Z self = 2025-05-07T20:32:32.4449282Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.4450700Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917dad1c0>} 2025-05-07T20:32:32.4452140Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.4453212Z context = 2025-05-07T20:32:32.4453509Z 2025-05-07T20:32:32.4453681Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.4454220Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.4454706Z module_map=module_map) 2025-05-07T20:32:32.4455072Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.4455430Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.4455691Z E ^ 2025-05-07T20:32:32.4456173Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.4456644Z 2025-05-07T20:32:32.4457080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.4457623Z 2025-05-07T20:32:32.7536645Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7537248Z self=, 2025-05-07T20:32:32.7537826Z T=4096, 2025-05-07T20:32:32.7538078Z D=5120, 2025-05-07T20:32:32.7538533Z scale_ub=1200.0, 2025-05-07T20:32:32.7538833Z contiguous=False, 2025-05-07T20:32:32.7539142Z compiled=False, 2025-05-07T20:32:32.7539416Z ) 2025-05-07T20:32:32.7539743Z self = 2025-05-07T20:32:32.7540255Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:32.7540550Z 2025-05-07T20:32:32.7540627Z @given( 2025-05-07T20:32:32.7540861Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7541169Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7541477Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7541834Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7542187Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7542471Z ) 2025-05-07T20:32:32.7542830Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7543279Z def test_silu_mul_quant( 2025-05-07T20:32:32.7543524Z self, 2025-05-07T20:32:32.7543721Z T: int, 2025-05-07T20:32:32.7543916Z D: int, 2025-05-07T20:32:32.7544212Z scale_ub: Optional[float], 2025-05-07T20:32:32.7544493Z contiguous: bool, 2025-05-07T20:32:32.7544735Z compiled: bool, 2025-05-07T20:32:32.7544957Z ) -> None: 2025-05-07T20:32:32.7545244Z torch.manual_seed(2025) 2025-05-07T20:32:32.7545486Z 2025-05-07T20:32:32.7545758Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7546106Z 2025-05-07T20:32:32.7546302Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7546592Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7546968Z x = x_sign * x_clamp 2025-05-07T20:32:32.7547210Z x0 = x[:, :D] 2025-05-07T20:32:32.7547423Z x1 = x[:, D:] 2025-05-07T20:32:32.7547636Z 2025-05-07T20:32:32.7547823Z if contiguous: 2025-05-07T20:32:32.7548056Z x0 = x0.contiguous() 2025-05-07T20:32:32.7548316Z x1 = x1.contiguous() 2025-05-07T20:32:32.7548560Z 2025-05-07T20:32:32.7548755Z if scale_ub is not None: 2025-05-07T20:32:32.7549036Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7549377Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7549684Z ) 2025-05-07T20:32:32.7549882Z else: 2025-05-07T20:32:32.7550096Z scale_ub_tensor = None 2025-05-07T20:32:32.7550349Z 2025-05-07T20:32:32.7550575Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7550891Z op = silu_mul_quant 2025-05-07T20:32:32.7551144Z if compiled: 2025-05-07T20:32:32.7551390Z op = torch.compile(op) 2025-05-07T20:32:32.7551692Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7552001Z 2025-05-07T20:32:32.7552216Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.7552387Z 2025-05-07T20:32:32.7552488Z moe/activation_test.py:117: 2025-05-07T20:32:32.7552788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7553123Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.7553407Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7554118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.7554827Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.7555372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7556074Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7556753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7557296Z kernel = self.compile( 2025-05-07T20:32:32.7557901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7558580Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7558980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7559223Z 2025-05-07T20:32:32.7559431Z self = 2025-05-07T20:32:32.7560548Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7561969Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917dae160>} 2025-05-07T20:32:32.7563403Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7564529Z context = 2025-05-07T20:32:32.7564829Z 2025-05-07T20:32:32.7564998Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7565530Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7566042Z module_map=module_map) 2025-05-07T20:32:32.7566410Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7566766Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.7567025Z E ^ 2025-05-07T20:32:32.7567537Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7568004Z 2025-05-07T20:32:32.7568435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7568963Z 2025-05-07T20:32:32.7569070Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7569488Z self=, 2025-05-07T20:32:32.7569900Z T=4096, 2025-05-07T20:32:32.7570086Z D=5120, 2025-05-07T20:32:32.7570274Z scale_ub=1200.0, 2025-05-07T20:32:32.7570499Z contiguous=False, 2025-05-07T20:32:32.7570727Z compiled=True, 2025-05-07T20:32:32.7570930Z ) 2025-05-07T20:32:32.7571249Z self = 2025-05-07T20:32:32.7571756Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:32.7572114Z 2025-05-07T20:32:32.7572196Z @given( 2025-05-07T20:32:32.7572420Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7572734Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7573045Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7573371Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7573701Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7573990Z ) 2025-05-07T20:32:32.7574344Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7574791Z def test_silu_mul_quant( 2025-05-07T20:32:32.7575037Z self, 2025-05-07T20:32:32.7575234Z T: int, 2025-05-07T20:32:32.7575428Z D: int, 2025-05-07T20:32:32.7575642Z scale_ub: Optional[float], 2025-05-07T20:32:32.7575915Z contiguous: bool, 2025-05-07T20:32:32.7576152Z compiled: bool, 2025-05-07T20:32:32.7576375Z ) -> None: 2025-05-07T20:32:32.7576598Z torch.manual_seed(2025) 2025-05-07T20:32:32.7576843Z 2025-05-07T20:32:32.7577115Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7577458Z 2025-05-07T20:32:32.7577645Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7577986Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7578305Z x = x_sign * x_clamp 2025-05-07T20:32:32.7578542Z x0 = x[:, :D] 2025-05-07T20:32:32.7578758Z x1 = x[:, D:] 2025-05-07T20:32:32.7578967Z 2025-05-07T20:32:32.7579150Z if contiguous: 2025-05-07T20:32:32.7579381Z x0 = x0.contiguous() 2025-05-07T20:32:32.7579643Z x1 = x1.contiguous() 2025-05-07T20:32:32.7579882Z 2025-05-07T20:32:32.7580068Z if scale_ub is not None: 2025-05-07T20:32:32.7580340Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7580676Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7580982Z ) 2025-05-07T20:32:32.7581180Z else: 2025-05-07T20:32:32.7581389Z scale_ub_tensor = None 2025-05-07T20:32:32.7581637Z 2025-05-07T20:32:32.7581875Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7582195Z op = silu_mul_quant 2025-05-07T20:32:32.7582444Z if compiled: 2025-05-07T20:32:32.7582691Z op = torch.compile(op) 2025-05-07T20:32:32.7583034Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7583306Z 2025-05-07T20:32:32.7583506Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.7583670Z 2025-05-07T20:32:32.7583773Z moe/activation_test.py:117: 2025-05-07T20:32:32.7584116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7584449Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.7584729Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7585296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.7585903Z return fn(*args, **kwargs) 2025-05-07T20:32:32.7586576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.7587282Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.7587842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7588537Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7589220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7589765Z kernel = self.compile( 2025-05-07T20:32:32.7590315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7590984Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7591392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7591626Z 2025-05-07T20:32:32.7591837Z self = 2025-05-07T20:32:32.7592949Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7594367Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917daf240>} 2025-05-07T20:32:32.7595757Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7596811Z context = 2025-05-07T20:32:32.7597108Z 2025-05-07T20:32:32.7597283Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7597860Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7598341Z module_map=module_map) 2025-05-07T20:32:32.7598715Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7599072Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.7599331Z E ^ 2025-05-07T20:32:32.7599806Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7600271Z 2025-05-07T20:32:32.7600704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7601234Z 2025-05-07T20:32:32.8742449Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8743869Z self=, 2025-05-07T20:32:32.8744708Z T=2048, 2025-05-07T20:32:32.8745019Z D=7168, 2025-05-07T20:32:32.8745332Z scale_ub=1200.0, 2025-05-07T20:32:32.8745709Z contiguous=False, 2025-05-07T20:32:32.8746089Z compiled=False, 2025-05-07T20:32:32.8746433Z ) 2025-05-07T20:32:32.8746983Z self = 2025-05-07T20:32:32.8748229Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:32.8748744Z 2025-05-07T20:32:32.8748881Z @given( 2025-05-07T20:32:32.8749253Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8749931Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8750463Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8751036Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8751619Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8752125Z ) 2025-05-07T20:32:32.8752867Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8753667Z def test_silu_mul_quant( 2025-05-07T20:32:32.8754074Z self, 2025-05-07T20:32:32.8754378Z T: int, 2025-05-07T20:32:32.8754676Z D: int, 2025-05-07T20:32:32.8755019Z scale_ub: Optional[float], 2025-05-07T20:32:32.8755454Z contiguous: bool, 2025-05-07T20:32:32.8755835Z compiled: bool, 2025-05-07T20:32:32.8756199Z ) -> None: 2025-05-07T20:32:32.8756546Z torch.manual_seed(2025) 2025-05-07T20:32:32.8756940Z 2025-05-07T20:32:32.8757400Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8758013Z 2025-05-07T20:32:32.8758322Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8758818Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8759355Z x = x_sign * x_clamp 2025-05-07T20:32:32.8759756Z x0 = x[:, :D] 2025-05-07T20:32:32.8760128Z x1 = x[:, D:] 2025-05-07T20:32:32.8760481Z 2025-05-07T20:32:32.8760780Z if contiguous: 2025-05-07T20:32:32.8761174Z x0 = x0.contiguous() 2025-05-07T20:32:32.8761606Z x1 = x1.contiguous() 2025-05-07T20:32:32.8762074Z 2025-05-07T20:32:32.8762380Z if scale_ub is not None: 2025-05-07T20:32:32.8762844Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8763424Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8763969Z ) 2025-05-07T20:32:32.8764305Z else: 2025-05-07T20:32:32.8764670Z scale_ub_tensor = None 2025-05-07T20:32:32.8765102Z 2025-05-07T20:32:32.8765495Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8766050Z op = silu_mul_quant 2025-05-07T20:32:32.8766466Z if compiled: 2025-05-07T20:32:32.8766882Z op = torch.compile(op) 2025-05-07T20:32:32.8767392Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8767861Z 2025-05-07T20:32:32.8768181Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8768465Z 2025-05-07T20:32:32.8768642Z moe/activation_test.py:117: 2025-05-07T20:32:32.8769293Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8769877Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8770361Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8771651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8773051Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8774041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8775323Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8776480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8777244Z kernel = self.compile( 2025-05-07T20:32:32.8778007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8778956Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8779499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8779931Z 2025-05-07T20:32:32.8780213Z self = 2025-05-07T20:32:32.8781822Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8784126Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917a54220>} 2025-05-07T20:32:32.8786267Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8787834Z context = 2025-05-07T20:32:32.8788267Z 2025-05-07T20:32:32.8788541Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8789405Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8790180Z module_map=module_map) 2025-05-07T20:32:32.8790757Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8791318Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8791729Z E ^ 2025-05-07T20:32:32.8792498Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8793268Z 2025-05-07T20:32:32.8793968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8794848Z 2025-05-07T20:32:32.8795008Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8795680Z self=, 2025-05-07T20:32:32.8808068Z T=1, 2025-05-07T20:32:32.8808409Z D=7168, 2025-05-07T20:32:32.8808711Z scale_ub=None, 2025-05-07T20:32:32.8809042Z contiguous=True, 2025-05-07T20:32:32.8809380Z compiled=False, 2025-05-07T20:32:32.8809700Z ) 2025-05-07T20:32:32.8810217Z self = 2025-05-07T20:32:32.8811013Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:32.8811454Z 2025-05-07T20:32:32.8811571Z @given( 2025-05-07T20:32:32.8812021Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8812517Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8813002Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8813529Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8814192Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8814639Z ) 2025-05-07T20:32:32.8815205Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8815929Z def test_silu_mul_quant( 2025-05-07T20:32:32.8816299Z self, 2025-05-07T20:32:32.8816594Z T: int, 2025-05-07T20:32:32.8816893Z D: int, 2025-05-07T20:32:32.8817220Z scale_ub: Optional[float], 2025-05-07T20:32:32.8817646Z contiguous: bool, 2025-05-07T20:32:32.8818014Z compiled: bool, 2025-05-07T20:32:32.8818351Z ) -> None: 2025-05-07T20:32:32.8818679Z torch.manual_seed(2025) 2025-05-07T20:32:32.8819056Z 2025-05-07T20:32:32.8819470Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8820020Z 2025-05-07T20:32:32.8820315Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8820761Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8821262Z x = x_sign * x_clamp 2025-05-07T20:32:32.8821637Z x0 = x[:, :D] 2025-05-07T20:32:32.8821969Z x1 = x[:, D:] 2025-05-07T20:32:32.8822282Z 2025-05-07T20:32:32.8822651Z if contiguous: 2025-05-07T20:32:32.8823011Z x0 = x0.contiguous() 2025-05-07T20:32:32.8823407Z x1 = x1.contiguous() 2025-05-07T20:32:32.8823784Z 2025-05-07T20:32:32.8824080Z if scale_ub is not None: 2025-05-07T20:32:32.8824574Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8825105Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8825599Z ) 2025-05-07T20:32:32.8825888Z else: 2025-05-07T20:32:32.8826211Z scale_ub_tensor = None 2025-05-07T20:32:32.8826604Z 2025-05-07T20:32:32.8827028Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8827527Z op = silu_mul_quant 2025-05-07T20:32:32.8827915Z if compiled: 2025-05-07T20:32:32.8828294Z op = torch.compile(op) 2025-05-07T20:32:32.8828762Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8829197Z 2025-05-07T20:32:32.8829483Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8829754Z 2025-05-07T20:32:32.8829905Z moe/activation_test.py:117: 2025-05-07T20:32:32.8830375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8830908Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8831348Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8832559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8833727Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8834610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8835752Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8836864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8837752Z kernel = self.compile( 2025-05-07T20:32:32.8838636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8839731Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8840374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8840747Z 2025-05-07T20:32:32.8841075Z self = 2025-05-07T20:32:32.8842902Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8845329Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917a55120>} 2025-05-07T20:32:32.8847631Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8849358Z context = 2025-05-07T20:32:32.8849832Z 2025-05-07T20:32:32.8850091Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8850950Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8851717Z module_map=module_map) 2025-05-07T20:32:32.8852377Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8852925Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8853331Z E ^ 2025-05-07T20:32:32.8854094Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8854854Z 2025-05-07T20:32:32.8855614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8856485Z 2025-05-07T20:32:32.8856643Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8857305Z self=, 2025-05-07T20:32:32.8858010Z T=16384, 2025-05-07T20:32:32.8858297Z D=7168, 2025-05-07T20:32:32.8858593Z scale_ub=1200.0, 2025-05-07T20:32:32.8858938Z contiguous=False, 2025-05-07T20:32:32.8859276Z compiled=True, 2025-05-07T20:32:33.1232970Z ) 2025-05-07T20:32:33.1233910Z self = 2025-05-07T20:32:33.1234763Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:33.1235254Z 2025-05-07T20:32:33.1235374Z @given( 2025-05-07T20:32:33.1235738Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.1236251Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.1236746Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.1237281Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.1237824Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.1238248Z ) 2025-05-07T20:32:33.1238807Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.1239573Z def test_silu_mul_quant( 2025-05-07T20:32:33.1239952Z self, 2025-05-07T20:32:33.1240257Z T: int, 2025-05-07T20:32:33.1240562Z D: int, 2025-05-07T20:32:33.1240897Z scale_ub: Optional[float], 2025-05-07T20:32:33.1241322Z contiguous: bool, 2025-05-07T20:32:33.1241698Z compiled: bool, 2025-05-07T20:32:33.1242050Z ) -> None: 2025-05-07T20:32:33.1242398Z torch.manual_seed(2025) 2025-05-07T20:32:33.1242804Z 2025-05-07T20:32:33.1243229Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.1243801Z 2025-05-07T20:32:33.1244114Z x_sign = torch.sign(x) 2025-05-07T20:32:33.1244603Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.1245106Z x = x_sign * x_clamp 2025-05-07T20:32:33.1245509Z x0 = x[:, :D] 2025-05-07T20:32:33.1245853Z x1 = x[:, D:] 2025-05-07T20:32:33.1246190Z 2025-05-07T20:32:33.1246484Z if contiguous: 2025-05-07T20:32:33.1246854Z x0 = x0.contiguous() 2025-05-07T20:32:33.1247262Z x1 = x1.contiguous() 2025-05-07T20:32:33.1247658Z 2025-05-07T20:32:33.1247958Z if scale_ub is not None: 2025-05-07T20:32:33.1248403Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.1249005Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.1249504Z ) 2025-05-07T20:32:33.1249811Z else: 2025-05-07T20:32:33.1250270Z scale_ub_tensor = None 2025-05-07T20:32:33.1250673Z 2025-05-07T20:32:33.1251027Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.1251546Z op = silu_mul_quant 2025-05-07T20:32:33.1252055Z if compiled: 2025-05-07T20:32:33.1252451Z op = torch.compile(op) 2025-05-07T20:32:33.1252933Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.1253386Z 2025-05-07T20:32:33.1253669Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.1253904Z 2025-05-07T20:32:33.1254036Z moe/activation_test.py:117: 2025-05-07T20:32:33.1254439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.1254905Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.1255298Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.1256098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.1256921Z return fn(*args, **kwargs) 2025-05-07T20:32:33.1258071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.1259201Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.1260019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.1261267Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.1262378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.1263307Z kernel = self.compile( 2025-05-07T20:32:33.1264176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.1265468Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.1266155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.1266571Z 2025-05-07T20:32:33.1266929Z self = 2025-05-07T20:32:33.1268928Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.1271431Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917a56520>} 2025-05-07T20:32:33.1273823Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.1275560Z context = 2025-05-07T20:32:33.1276003Z 2025-05-07T20:32:33.1276268Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.1277103Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.1277873Z module_map=module_map) 2025-05-07T20:32:33.1278472Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.1279029Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.1279443Z E ^ 2025-05-07T20:32:33.1280225Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.1281016Z 2025-05-07T20:32:33.1281755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.1282671Z 2025-05-07T20:32:33.1282832Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.1283602Z self=, 2025-05-07T20:32:33.1284279Z T=1, 2025-05-07T20:32:33.1284576Z D=7168, 2025-05-07T20:32:33.1284899Z scale_ub=None, 2025-05-07T20:32:33.1285260Z contiguous=False, 2025-05-07T20:32:33.1285615Z compiled=False, 2025-05-07T20:32:33.1285936Z ) 2025-05-07T20:32:33.1286409Z self = 2025-05-07T20:32:33.1287187Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:33.1287668Z 2025-05-07T20:32:33.1287792Z @given( 2025-05-07T20:32:33.1288172Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.1288713Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.1289233Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.1289815Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.1290378Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.1290893Z ) 2025-05-07T20:32:33.1291513Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.1292486Z def test_silu_mul_quant( 2025-05-07T20:32:33.1292897Z self, 2025-05-07T20:32:33.1293308Z T: int, 2025-05-07T20:32:33.1293628Z D: int, 2025-05-07T20:32:33.1293989Z scale_ub: Optional[float], 2025-05-07T20:32:33.1294451Z contiguous: bool, 2025-05-07T20:32:33.1294840Z compiled: bool, 2025-05-07T20:32:33.1295332Z ) -> None: 2025-05-07T20:32:33.1295685Z torch.manual_seed(2025) 2025-05-07T20:32:33.1296085Z 2025-05-07T20:32:33.1296547Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.1297149Z 2025-05-07T20:32:33.1297464Z x_sign = torch.sign(x) 2025-05-07T20:32:33.1298010Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.1298540Z x = x_sign * x_clamp 2025-05-07T20:32:33.1298902Z x0 = x[:, :D] 2025-05-07T20:32:33.1299228Z x1 = x[:, D:] 2025-05-07T20:32:33.1299553Z 2025-05-07T20:32:33.1299838Z if contiguous: 2025-05-07T20:32:33.1300189Z x0 = x0.contiguous() 2025-05-07T20:32:33.1300605Z x1 = x1.contiguous() 2025-05-07T20:32:33.1300993Z 2025-05-07T20:32:33.1301284Z if scale_ub is not None: 2025-05-07T20:32:33.1301742Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.1302319Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.1302845Z ) 2025-05-07T20:32:33.1303160Z else: 2025-05-07T20:32:33.1303506Z scale_ub_tensor = None 2025-05-07T20:32:33.1303924Z 2025-05-07T20:32:33.1304308Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.1304856Z op = silu_mul_quant 2025-05-07T20:32:33.1305282Z if compiled: 2025-05-07T20:32:33.1305685Z op = torch.compile(op) 2025-05-07T20:32:33.1306520Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.1307011Z 2025-05-07T20:32:33.1307320Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.1307596Z 2025-05-07T20:32:33.1307756Z moe/activation_test.py:117: 2025-05-07T20:32:33.1308262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.1308837Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.1309322Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.1310608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.1311913Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.1312889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.1314165Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.1315412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.1316525Z kernel = self.compile( 2025-05-07T20:32:33.1317525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.1318753Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.1319459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.1319875Z 2025-05-07T20:32:33.1320227Z self = 2025-05-07T20:32:33.1321843Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.1323820Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917a57100>} 2025-05-07T20:32:33.1325930Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.1327459Z context = 2025-05-07T20:32:33.1327878Z 2025-05-07T20:32:33.1328132Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.1329056Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.1329749Z module_map=module_map) 2025-05-07T20:32:33.1330281Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.1330785Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.1331330Z E ^ 2025-05-07T20:32:33.1332139Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.1332908Z 2025-05-07T20:32:33.1333615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.1334506Z 2025-05-07T20:32:33.1334671Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.1335308Z self=, 2025-05-07T20:32:33.1335985Z T=2048, 2025-05-07T20:32:33.1336299Z D=7168, 2025-05-07T20:32:33.1336610Z scale_ub=None, 2025-05-07T20:32:33.1336957Z contiguous=False, 2025-05-07T20:32:33.1337330Z compiled=True, 2025-05-07T20:32:33.1337665Z ) 2025-05-07T20:32:33.2202663Z self = 2025-05-07T20:32:33.2203637Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:33.2204161Z 2025-05-07T20:32:33.2204299Z @given( 2025-05-07T20:32:33.2204672Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.2205222Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.2205736Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.2206636Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.2207230Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.2207720Z ) 2025-05-07T20:32:33.2208325Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.2209130Z def test_silu_mul_quant( 2025-05-07T20:32:33.2209543Z self, 2025-05-07T20:32:33.2209856Z T: int, 2025-05-07T20:32:33.2210181Z D: int, 2025-05-07T20:32:33.2210537Z scale_ub: Optional[float], 2025-05-07T20:32:33.2211000Z contiguous: bool, 2025-05-07T20:32:33.2211388Z compiled: bool, 2025-05-07T20:32:33.2211772Z ) -> None: 2025-05-07T20:32:33.2212214Z torch.manual_seed(2025) 2025-05-07T20:32:33.2212619Z 2025-05-07T20:32:33.2213075Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.2213971Z 2025-05-07T20:32:33.2214280Z x_sign = torch.sign(x) 2025-05-07T20:32:33.2214736Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.2215231Z x = x_sign * x_clamp 2025-05-07T20:32:33.2215604Z x0 = x[:, :D] 2025-05-07T20:32:33.2215950Z x1 = x[:, D:] 2025-05-07T20:32:33.2216276Z 2025-05-07T20:32:33.2216559Z if contiguous: 2025-05-07T20:32:33.2216930Z x0 = x0.contiguous() 2025-05-07T20:32:33.2217358Z x1 = x1.contiguous() 2025-05-07T20:32:33.2217763Z 2025-05-07T20:32:33.2218080Z if scale_ub is not None: 2025-05-07T20:32:33.2218558Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.2219142Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.2219663Z ) 2025-05-07T20:32:33.2219972Z else: 2025-05-07T20:32:33.2220313Z scale_ub_tensor = None 2025-05-07T20:32:33.2220732Z 2025-05-07T20:32:33.2221112Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.2221635Z op = silu_mul_quant 2025-05-07T20:32:33.2222055Z if compiled: 2025-05-07T20:32:33.2222597Z op = torch.compile(op) 2025-05-07T20:32:33.2223085Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.2223544Z 2025-05-07T20:32:33.2223856Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.2224243Z 2025-05-07T20:32:33.2224413Z moe/activation_test.py:117: 2025-05-07T20:32:33.2224906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.2225483Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.2225960Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.2226983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.2228124Z return fn(*args, **kwargs) 2025-05-07T20:32:33.2229351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.2230644Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.2231616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.2232938Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.2234177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.2235161Z kernel = self.compile( 2025-05-07T20:32:33.2236125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.2237081Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.2237636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.2237960Z 2025-05-07T20:32:33.2238254Z self = 2025-05-07T20:32:33.2239769Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.2241824Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac8b44720>} 2025-05-07T20:32:33.2243880Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.2245452Z context = 2025-05-07T20:32:33.2245881Z 2025-05-07T20:32:33.2246118Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.2247002Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.2247709Z module_map=module_map) 2025-05-07T20:32:33.2248274Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.2248824Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.2249234Z E ^ 2025-05-07T20:32:33.2249997Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.2250767Z 2025-05-07T20:32:33.2251467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.2252484Z 2025-05-07T20:32:33.2252642Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.2253313Z self=, 2025-05-07T20:32:33.2253963Z T=4096, 2025-05-07T20:32:33.2254244Z D=7168, 2025-05-07T20:32:33.2254536Z scale_ub=None, 2025-05-07T20:32:33.2254871Z contiguous=False, 2025-05-07T20:32:33.2255208Z compiled=True, 2025-05-07T20:32:33.2255525Z ) 2025-05-07T20:32:33.2256679Z self = 2025-05-07T20:32:33.2257491Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:33.2257945Z 2025-05-07T20:32:33.2258059Z @given( 2025-05-07T20:32:33.2258463Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.2258956Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.2259429Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.2259957Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.2260480Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.2260974Z ) 2025-05-07T20:32:33.2261539Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.2262267Z def test_silu_mul_quant( 2025-05-07T20:32:33.2262637Z self, 2025-05-07T20:32:33.2262932Z T: int, 2025-05-07T20:32:33.2263235Z D: int, 2025-05-07T20:32:33.2263560Z scale_ub: Optional[float], 2025-05-07T20:32:33.2263983Z contiguous: bool, 2025-05-07T20:32:33.2264355Z compiled: bool, 2025-05-07T20:32:33.2264693Z ) -> None: 2025-05-07T20:32:33.2265025Z torch.manual_seed(2025) 2025-05-07T20:32:33.2265406Z 2025-05-07T20:32:33.2265822Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.2266362Z 2025-05-07T20:32:33.2266654Z x_sign = torch.sign(x) 2025-05-07T20:32:33.2267104Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.2267584Z x = x_sign * x_clamp 2025-05-07T20:32:33.2267962Z x0 = x[:, :D] 2025-05-07T20:32:33.2268291Z x1 = x[:, D:] 2025-05-07T20:32:33.2268599Z 2025-05-07T20:32:33.2268876Z if contiguous: 2025-05-07T20:32:33.2269228Z x0 = x0.contiguous() 2025-05-07T20:32:33.2269626Z x1 = x1.contiguous() 2025-05-07T20:32:33.2270001Z 2025-05-07T20:32:33.2270294Z if scale_ub is not None: 2025-05-07T20:32:33.2270728Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.2281824Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.2282320Z ) 2025-05-07T20:32:33.2282613Z else: 2025-05-07T20:32:33.2282933Z scale_ub_tensor = None 2025-05-07T20:32:33.2283331Z 2025-05-07T20:32:33.2283693Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.2284185Z op = silu_mul_quant 2025-05-07T20:32:33.2284574Z if compiled: 2025-05-07T20:32:33.2284953Z op = torch.compile(op) 2025-05-07T20:32:33.2285415Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.2285847Z 2025-05-07T20:32:33.2286137Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.2286396Z 2025-05-07T20:32:33.2286547Z moe/activation_test.py:117: 2025-05-07T20:32:33.2287092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.2287629Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.2288070Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.2288985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.2289928Z return fn(*args, **kwargs) 2025-05-07T20:32:33.2291036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.2292270Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.2293153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.2294301Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.2295417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.2296295Z kernel = self.compile( 2025-05-07T20:32:33.2297247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.2298341Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.2298970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.2299400Z 2025-05-07T20:32:33.2299725Z self = 2025-05-07T20:32:33.2301553Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.2303974Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac8b45440>} 2025-05-07T20:32:33.2306546Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.2308281Z context = 2025-05-07T20:32:33.2308768Z 2025-05-07T20:32:33.2309028Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.2309885Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.2310612Z module_map=module_map) 2025-05-07T20:32:33.2311144Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.2311672Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.2312052Z E ^ 2025-05-07T20:32:33.2312752Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.2313472Z 2025-05-07T20:32:33.2314134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.2315002Z 2025-05-07T20:32:33.3867545Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.3868364Z self=, 2025-05-07T20:32:33.3869115Z T=16384, 2025-05-07T20:32:33.3869429Z D=5120, 2025-05-07T20:32:33.3869750Z scale_ub=1200.0, 2025-05-07T20:32:33.3870116Z contiguous=False, 2025-05-07T20:32:33.3870472Z compiled=False, 2025-05-07T20:32:33.3870810Z ) 2025-05-07T20:32:33.3871359Z self = 2025-05-07T20:32:33.3872269Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:33.3872780Z 2025-05-07T20:32:33.3872904Z @given( 2025-05-07T20:32:33.3873607Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.3874164Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.3874683Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.3875266Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.3875839Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.3876326Z ) 2025-05-07T20:32:33.3876942Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.3877750Z def test_silu_mul_quant( 2025-05-07T20:32:33.3878142Z self, 2025-05-07T20:32:33.3878462Z T: int, 2025-05-07T20:32:33.3878769Z D: int, 2025-05-07T20:32:33.3879100Z scale_ub: Optional[float], 2025-05-07T20:32:33.3879517Z contiguous: bool, 2025-05-07T20:32:33.3879898Z compiled: bool, 2025-05-07T20:32:33.3880258Z ) -> None: 2025-05-07T20:32:33.3880585Z torch.manual_seed(2025) 2025-05-07T20:32:33.3880968Z 2025-05-07T20:32:33.3881406Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.3882010Z 2025-05-07T20:32:33.3882360Z x_sign = torch.sign(x) 2025-05-07T20:32:33.3882978Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.3883506Z x = x_sign * x_clamp 2025-05-07T20:32:33.3883903Z x0 = x[:, :D] 2025-05-07T20:32:33.3884255Z x1 = x[:, D:] 2025-05-07T20:32:33.3884699Z 2025-05-07T20:32:33.3884995Z if contiguous: 2025-05-07T20:32:33.3885376Z x0 = x0.contiguous() 2025-05-07T20:32:33.3885801Z x1 = x1.contiguous() 2025-05-07T20:32:33.3886199Z 2025-05-07T20:32:33.3886515Z if scale_ub is not None: 2025-05-07T20:32:33.3886971Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.3887655Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.3888190Z ) 2025-05-07T20:32:33.3888505Z else: 2025-05-07T20:32:33.3888843Z scale_ub_tensor = None 2025-05-07T20:32:33.3889272Z 2025-05-07T20:32:33.3889654Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.3890195Z op = silu_mul_quant 2025-05-07T20:32:33.3890624Z if compiled: 2025-05-07T20:32:33.3891046Z op = torch.compile(op) 2025-05-07T20:32:33.3891567Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.3892156Z 2025-05-07T20:32:33.3892481Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.3892763Z 2025-05-07T20:32:33.3892924Z moe/activation_test.py:117: 2025-05-07T20:32:33.3893431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.3894011Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.3894490Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.3895777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.3897081Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.3898069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.3899343Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.3900566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.3901338Z kernel = self.compile( 2025-05-07T20:32:33.3902085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.3903008Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.3903563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.3903889Z 2025-05-07T20:32:33.3904177Z self = 2025-05-07T20:32:33.3905848Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.3908351Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac8b46340>} 2025-05-07T20:32:33.3910410Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.3911935Z context = 2025-05-07T20:32:33.3912421Z 2025-05-07T20:32:33.3912698Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.3913497Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.3914271Z module_map=module_map) 2025-05-07T20:32:33.3914863Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.3915573Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.3915992Z E ^ 2025-05-07T20:32:33.3916758Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.3917653Z 2025-05-07T20:32:33.3918439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.3919407Z 2025-05-07T20:32:33.3919578Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.3920308Z self=, 2025-05-07T20:32:33.3921138Z T=16384, 2025-05-07T20:32:33.3921448Z D=5120, 2025-05-07T20:32:33.3921761Z scale_ub=1200.0, 2025-05-07T20:32:33.3922131Z contiguous=True, 2025-05-07T20:32:33.3922497Z compiled=True, 2025-05-07T20:32:33.3922826Z ) 2025-05-07T20:32:33.3923373Z self = 2025-05-07T20:32:33.3924270Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:33.3924768Z 2025-05-07T20:32:33.3924893Z @given( 2025-05-07T20:32:33.3925275Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.3925818Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.3926339Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.3926909Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.3927478Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.3927971Z ) 2025-05-07T20:32:33.3928574Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.3929374Z def test_silu_mul_quant( 2025-05-07T20:32:33.3929780Z self, 2025-05-07T20:32:33.3930091Z T: int, 2025-05-07T20:32:33.3930410Z D: int, 2025-05-07T20:32:33.3930771Z scale_ub: Optional[float], 2025-05-07T20:32:33.3931227Z contiguous: bool, 2025-05-07T20:32:33.3931627Z compiled: bool, 2025-05-07T20:32:33.3932128Z ) -> None: 2025-05-07T20:32:33.3932474Z torch.manual_seed(2025) 2025-05-07T20:32:33.3932883Z 2025-05-07T20:32:33.3933337Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.3933930Z 2025-05-07T20:32:33.3934243Z x_sign = torch.sign(x) 2025-05-07T20:32:33.3934733Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.3935265Z x = x_sign * x_clamp 2025-05-07T20:32:33.3935660Z x0 = x[:, :D] 2025-05-07T20:32:33.3936014Z x1 = x[:, D:] 2025-05-07T20:32:33.3936364Z 2025-05-07T20:32:33.3936655Z if contiguous: 2025-05-07T20:32:33.3937038Z x0 = x0.contiguous() 2025-05-07T20:32:33.3937473Z x1 = x1.contiguous() 2025-05-07T20:32:33.3937899Z 2025-05-07T20:32:33.3938330Z if scale_ub is not None: 2025-05-07T20:32:33.3938800Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.3939381Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.3939898Z ) 2025-05-07T20:32:33.3940219Z else: 2025-05-07T20:32:33.3940561Z scale_ub_tensor = None 2025-05-07T20:32:33.3940984Z 2025-05-07T20:32:33.3941361Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.3941906Z op = silu_mul_quant 2025-05-07T20:32:33.3942314Z if compiled: 2025-05-07T20:32:33.3942723Z op = torch.compile(op) 2025-05-07T20:32:33.3943222Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.3943686Z 2025-05-07T20:32:33.3943999Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.3944278Z 2025-05-07T20:32:33.3944447Z moe/activation_test.py:117: 2025-05-07T20:32:33.3944939Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.3945516Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.3945999Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.3947084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.3948108Z return fn(*args, **kwargs) 2025-05-07T20:32:33.3949315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.3950695Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.3951678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.3952933Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.3954215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.3955200Z kernel = self.compile( 2025-05-07T20:32:33.3956182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.3957393Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.3958086Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.3958505Z 2025-05-07T20:32:33.3958858Z self = 2025-05-07T20:32:33.3960884Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.3963528Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac8b479c0>} 2025-05-07T20:32:33.3966091Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.3968031Z context = 2025-05-07T20:32:33.3968563Z 2025-05-07T20:32:33.3968846Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.3969787Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.3970620Z module_map=module_map) 2025-05-07T20:32:33.3971249Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.3971945Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.3972394Z E ^ 2025-05-07T20:32:33.3973221Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.3974082Z 2025-05-07T20:32:33.3974927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.3975902Z 2025-05-07T20:32:33.5672070Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.5672859Z self=, 2025-05-07T20:32:33.5673546Z T=16384, 2025-05-07T20:32:33.5673842Z D=5120, 2025-05-07T20:32:33.5674139Z scale_ub=None, 2025-05-07T20:32:33.5674481Z contiguous=False, 2025-05-07T20:32:33.5674822Z compiled=True, 2025-05-07T20:32:33.5675142Z ) 2025-05-07T20:32:33.5675656Z self = 2025-05-07T20:32:33.5676474Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:33.5676956Z 2025-05-07T20:32:33.5677066Z @given( 2025-05-07T20:32:33.5677404Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.5677894Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.5678402Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.5678942Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.5679775Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.5680240Z ) 2025-05-07T20:32:33.5680821Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.5681579Z def test_silu_mul_quant( 2025-05-07T20:32:33.5682086Z self, 2025-05-07T20:32:33.5682385Z T: int, 2025-05-07T20:32:33.5682690Z D: int, 2025-05-07T20:32:33.5683017Z scale_ub: Optional[float], 2025-05-07T20:32:33.5683457Z contiguous: bool, 2025-05-07T20:32:33.5683849Z compiled: bool, 2025-05-07T20:32:33.5684215Z ) -> None: 2025-05-07T20:32:33.5684684Z torch.manual_seed(2025) 2025-05-07T20:32:33.5685072Z 2025-05-07T20:32:33.5685494Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.5686057Z 2025-05-07T20:32:33.5686363Z x_sign = torch.sign(x) 2025-05-07T20:32:33.5686813Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.5687321Z x = x_sign * x_clamp 2025-05-07T20:32:33.5687703Z x0 = x[:, :D] 2025-05-07T20:32:33.5688027Z x1 = x[:, D:] 2025-05-07T20:32:33.5688352Z 2025-05-07T20:32:33.5688632Z if contiguous: 2025-05-07T20:32:33.5688988Z x0 = x0.contiguous() 2025-05-07T20:32:33.5689388Z x1 = x1.contiguous() 2025-05-07T20:32:33.5689767Z 2025-05-07T20:32:33.5690065Z if scale_ub is not None: 2025-05-07T20:32:33.5690494Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.5691062Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.5691581Z ) 2025-05-07T20:32:33.5691974Z else: 2025-05-07T20:32:33.5692305Z scale_ub_tensor = None 2025-05-07T20:32:33.5692670Z 2025-05-07T20:32:33.5692972Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.5693412Z op = silu_mul_quant 2025-05-07T20:32:33.5693754Z if compiled: 2025-05-07T20:32:33.5694076Z op = torch.compile(op) 2025-05-07T20:32:33.5694485Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.5694863Z 2025-05-07T20:32:33.5695117Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.5695375Z 2025-05-07T20:32:33.5695514Z moe/activation_test.py:117: 2025-05-07T20:32:33.5695943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.5696439Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.5696854Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.5697768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.5698635Z return fn(*args, **kwargs) 2025-05-07T20:32:33.5699838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.5700948Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.5701855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.5702991Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.5704193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.5705165Z kernel = self.compile( 2025-05-07T20:32:33.5706446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.5707646Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.5708337Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.5708761Z 2025-05-07T20:32:33.5709104Z self = 2025-05-07T20:32:33.5711046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.5713522Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917c7cc20>} 2025-05-07T20:32:33.5715944Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.5717730Z context = 2025-05-07T20:32:33.5718365Z 2025-05-07T20:32:33.5718633Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.5719503Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.5720332Z module_map=module_map) 2025-05-07T20:32:33.5720904Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.5721439Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.5721807Z E ^ 2025-05-07T20:32:33.5722595Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.5723455Z 2025-05-07T20:32:33.5724233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.5725213Z 2025-05-07T20:32:33.5725369Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.5726088Z self=, 2025-05-07T20:32:33.5726803Z T=2048, 2025-05-07T20:32:33.5727110Z D=5120, 2025-05-07T20:32:33.5727420Z scale_ub=None, 2025-05-07T20:32:33.5727771Z contiguous=False, 2025-05-07T20:32:33.5728148Z compiled=True, 2025-05-07T20:32:33.5728482Z ) 2025-05-07T20:32:33.6629444Z self = 2025-05-07T20:32:33.6630462Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:33.6630960Z 2025-05-07T20:32:33.6631088Z @given( 2025-05-07T20:32:33.6631463Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.6632021Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.6632528Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.6633098Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.6633671Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.6634166Z ) 2025-05-07T20:32:33.6634770Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.6635557Z def test_silu_mul_quant( 2025-05-07T20:32:33.6635964Z self, 2025-05-07T20:32:33.6636598Z T: int, 2025-05-07T20:32:33.6636929Z D: int, 2025-05-07T20:32:33.6637294Z scale_ub: Optional[float], 2025-05-07T20:32:33.6637743Z contiguous: bool, 2025-05-07T20:32:33.6638153Z compiled: bool, 2025-05-07T20:32:33.6638527Z ) -> None: 2025-05-07T20:32:33.6638873Z torch.manual_seed(2025) 2025-05-07T20:32:33.6639284Z 2025-05-07T20:32:33.6639742Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.6640330Z 2025-05-07T20:32:33.6640633Z x_sign = torch.sign(x) 2025-05-07T20:32:33.6641095Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.6641585Z x = x_sign * x_clamp 2025-05-07T20:32:33.6641956Z x0 = x[:, :D] 2025-05-07T20:32:33.6642298Z x1 = x[:, D:] 2025-05-07T20:32:33.6642626Z 2025-05-07T20:32:33.6642908Z if contiguous: 2025-05-07T20:32:33.6643271Z x0 = x0.contiguous() 2025-05-07T20:32:33.6643704Z x1 = x1.contiguous() 2025-05-07T20:32:33.6644108Z 2025-05-07T20:32:33.6644420Z if scale_ub is not None: 2025-05-07T20:32:33.6644876Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.6645571Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.6646106Z ) 2025-05-07T20:32:33.6646417Z else: 2025-05-07T20:32:33.6646865Z scale_ub_tensor = None 2025-05-07T20:32:33.6647408Z 2025-05-07T20:32:33.6647783Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.6648313Z op = silu_mul_quant 2025-05-07T20:32:33.6648733Z if compiled: 2025-05-07T20:32:33.6649142Z op = torch.compile(op) 2025-05-07T20:32:33.6649632Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.6650233Z 2025-05-07T20:32:33.6650546Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.6650827Z 2025-05-07T20:32:33.6651000Z moe/activation_test.py:117: 2025-05-07T20:32:33.6651504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.6652195Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.6652693Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.6653713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.6654752Z return fn(*args, **kwargs) 2025-05-07T20:32:33.6655980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.6657269Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.6658244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.6659518Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.6660756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.6661738Z kernel = self.compile( 2025-05-07T20:32:33.6662692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.6663646Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.6664189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.6664515Z 2025-05-07T20:32:33.6664788Z self = 2025-05-07T20:32:33.6666309Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.6668380Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917c7d9e0>} 2025-05-07T20:32:33.6670541Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.6672105Z context = 2025-05-07T20:32:33.6672521Z 2025-05-07T20:32:33.6672762Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.6673537Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.6674264Z module_map=module_map) 2025-05-07T20:32:33.6685329Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.6685935Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.6686341Z E ^ 2025-05-07T20:32:33.6687108Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.6687881Z 2025-05-07T20:32:33.6688528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.6689322Z 2025-05-07T20:32:33.6689590Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.6690264Z self=, 2025-05-07T20:32:33.6690922Z T=2048, 2025-05-07T20:32:33.6691280Z D=5120, 2025-05-07T20:32:33.6691575Z scale_ub=1200.0, 2025-05-07T20:32:33.6692076Z contiguous=False, 2025-05-07T20:32:33.6692414Z compiled=True, 2025-05-07T20:32:33.6692719Z ) 2025-05-07T20:32:33.6693232Z self = 2025-05-07T20:32:33.6694104Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:33.6694701Z 2025-05-07T20:32:33.6694825Z @given( 2025-05-07T20:32:33.6695200Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.6695751Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.6696267Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.6696852Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.6697412Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.6697899Z ) 2025-05-07T20:32:33.6698510Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.6699307Z def test_silu_mul_quant( 2025-05-07T20:32:33.6699705Z self, 2025-05-07T20:32:33.6700025Z T: int, 2025-05-07T20:32:33.6700352Z D: int, 2025-05-07T20:32:33.6700704Z scale_ub: Optional[float], 2025-05-07T20:32:33.6701163Z contiguous: bool, 2025-05-07T20:32:33.6701563Z compiled: bool, 2025-05-07T20:32:33.6701942Z ) -> None: 2025-05-07T20:32:33.6702286Z torch.manual_seed(2025) 2025-05-07T20:32:33.6702693Z 2025-05-07T20:32:33.6703146Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.6703744Z 2025-05-07T20:32:33.6704061Z x_sign = torch.sign(x) 2025-05-07T20:32:33.6704551Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.6705082Z x = x_sign * x_clamp 2025-05-07T20:32:33.6705490Z x0 = x[:, :D] 2025-05-07T20:32:33.6705846Z x1 = x[:, D:] 2025-05-07T20:32:33.6706535Z 2025-05-07T20:32:33.6706859Z if contiguous: 2025-05-07T20:32:33.6707259Z x0 = x0.contiguous() 2025-05-07T20:32:33.6707687Z x1 = x1.contiguous() 2025-05-07T20:32:33.6708095Z 2025-05-07T20:32:33.6708405Z if scale_ub is not None: 2025-05-07T20:32:33.6708856Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.6709433Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.6709961Z ) 2025-05-07T20:32:33.6710280Z else: 2025-05-07T20:32:33.6710621Z scale_ub_tensor = None 2025-05-07T20:32:33.6711045Z 2025-05-07T20:32:33.6711558Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.6712099Z op = silu_mul_quant 2025-05-07T20:32:33.6712522Z if compiled: 2025-05-07T20:32:33.6712940Z op = torch.compile(op) 2025-05-07T20:32:33.6713435Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.6713910Z 2025-05-07T20:32:33.6714223Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.6714506Z 2025-05-07T20:32:33.6714668Z moe/activation_test.py:117: 2025-05-07T20:32:33.6715168Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.6715748Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.6716216Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.6717241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.6718295Z return fn(*args, **kwargs) 2025-05-07T20:32:33.6719531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.6720816Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.6721892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.6723210Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.6724537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.6725510Z kernel = self.compile( 2025-05-07T20:32:33.6726493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.6727807Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.6728499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.6728924Z 2025-05-07T20:32:33.6729283Z self = 2025-05-07T20:32:33.6731332Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.6734079Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917c7eb60>} 2025-05-07T20:32:33.6736657Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.6738587Z context = 2025-05-07T20:32:33.6739117Z 2025-05-07T20:32:33.6739399Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.6740335Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.6741185Z module_map=module_map) 2025-05-07T20:32:33.6741808Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.6742412Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.6742857Z E ^ 2025-05-07T20:32:33.6743679Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.6744544Z 2025-05-07T20:32:33.6745317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.6746298Z 2025-05-07T20:32:33.8482455Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.8483279Z self=, 2025-05-07T20:32:33.8483706Z T=4096, 2025-05-07T20:32:33.8483894Z D=5120, 2025-05-07T20:32:33.8484363Z scale_ub=1200.0, 2025-05-07T20:32:33.8484598Z contiguous=True, 2025-05-07T20:32:33.8484815Z compiled=True, 2025-05-07T20:32:33.8485028Z ) 2025-05-07T20:32:33.8485363Z self = 2025-05-07T20:32:33.8485871Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:33.8486162Z 2025-05-07T20:32:33.8486247Z @given( 2025-05-07T20:32:33.8486483Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.8486796Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.8487115Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.8487453Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.8487797Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.8488084Z ) 2025-05-07T20:32:33.8488444Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.8488896Z def test_silu_mul_quant( 2025-05-07T20:32:33.8489142Z self, 2025-05-07T20:32:33.8489347Z T: int, 2025-05-07T20:32:33.8489548Z D: int, 2025-05-07T20:32:33.8489861Z scale_ub: Optional[float], 2025-05-07T20:32:33.8490147Z contiguous: bool, 2025-05-07T20:32:33.8490392Z compiled: bool, 2025-05-07T20:32:33.8490622Z ) -> None: 2025-05-07T20:32:33.8490845Z torch.manual_seed(2025) 2025-05-07T20:32:33.8491172Z 2025-05-07T20:32:33.8491445Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.8491795Z 2025-05-07T20:32:33.8492080Z x_sign = torch.sign(x) 2025-05-07T20:32:33.8492420Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.8492741Z x = x_sign * x_clamp 2025-05-07T20:32:33.8493075Z x0 = x[:, :D] 2025-05-07T20:32:33.8493299Z x1 = x[:, D:] 2025-05-07T20:32:33.8493508Z 2025-05-07T20:32:33.8493697Z if contiguous: 2025-05-07T20:32:33.8493936Z x0 = x0.contiguous() 2025-05-07T20:32:33.8494194Z x1 = x1.contiguous() 2025-05-07T20:32:33.8494437Z 2025-05-07T20:32:33.8494628Z if scale_ub is not None: 2025-05-07T20:32:33.8494903Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.8495244Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.8495563Z ) 2025-05-07T20:32:33.8495753Z else: 2025-05-07T20:32:33.8495971Z scale_ub_tensor = None 2025-05-07T20:32:33.8496229Z 2025-05-07T20:32:33.8496459Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.8496782Z op = silu_mul_quant 2025-05-07T20:32:33.8497038Z if compiled: 2025-05-07T20:32:33.8497284Z op = torch.compile(op) 2025-05-07T20:32:33.8497588Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.8497869Z 2025-05-07T20:32:33.8498070Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.8498236Z 2025-05-07T20:32:33.8498339Z moe/activation_test.py:117: 2025-05-07T20:32:33.8498642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.8498984Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.8499268Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.8499843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.8500424Z return fn(*args, **kwargs) 2025-05-07T20:32:33.8501097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.8501806Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.8502358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.8503060Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.8503789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.8504339Z kernel = self.compile( 2025-05-07T20:32:33.8504901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.8505576Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.8505976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.8506767Z 2025-05-07T20:32:33.8507011Z self = 2025-05-07T20:32:33.8508140Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.8509596Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917880180>} 2025-05-07T20:32:33.8511114Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.8512183Z context = 2025-05-07T20:32:33.8512549Z 2025-05-07T20:32:33.8512719Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.8513259Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.8513731Z module_map=module_map) 2025-05-07T20:32:33.8514104Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.8514536Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.8514801Z E ^ 2025-05-07T20:32:33.8515283Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.8515756Z 2025-05-07T20:32:33.8516201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.8516729Z 2025-05-07T20:32:33.8516842Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.8517268Z self=, 2025-05-07T20:32:33.8517684Z T=128, 2025-05-07T20:32:33.8517879Z D=5120, 2025-05-07T20:32:33.8518075Z scale_ub=1200.0, 2025-05-07T20:32:33.8518299Z contiguous=False, 2025-05-07T20:32:33.8518528Z compiled=True, 2025-05-07T20:32:33.8518734Z ) 2025-05-07T20:32:34.1263978Z self = 2025-05-07T20:32:34.1264595Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:34.1264876Z 2025-05-07T20:32:34.1264961Z @given( 2025-05-07T20:32:34.1265205Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.1265525Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.1265836Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.1266170Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.1266501Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.1266787Z ) 2025-05-07T20:32:34.1267140Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.1267609Z def test_silu_mul_quant( 2025-05-07T20:32:34.1267860Z self, 2025-05-07T20:32:34.1268055Z T: int, 2025-05-07T20:32:34.1268260Z D: int, 2025-05-07T20:32:34.1268495Z scale_ub: Optional[float], 2025-05-07T20:32:34.1268777Z contiguous: bool, 2025-05-07T20:32:34.1269027Z compiled: bool, 2025-05-07T20:32:34.1269270Z ) -> None: 2025-05-07T20:32:34.1269486Z torch.manual_seed(2025) 2025-05-07T20:32:34.1269741Z 2025-05-07T20:32:34.1270356Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.1270716Z 2025-05-07T20:32:34.1270911Z x_sign = torch.sign(x) 2025-05-07T20:32:34.1271219Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.1271538Z x = x_sign * x_clamp 2025-05-07T20:32:34.1271790Z x0 = x[:, :D] 2025-05-07T20:32:34.1272024Z x1 = x[:, D:] 2025-05-07T20:32:34.1272245Z 2025-05-07T20:32:34.1272432Z if contiguous: 2025-05-07T20:32:34.1272674Z x0 = x0.contiguous() 2025-05-07T20:32:34.1272937Z x1 = x1.contiguous() 2025-05-07T20:32:34.1273178Z 2025-05-07T20:32:34.1273374Z if scale_ub is not None: 2025-05-07T20:32:34.1273656Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.1274001Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.1274322Z ) 2025-05-07T20:32:34.1274522Z else: 2025-05-07T20:32:34.1274735Z scale_ub_tensor = None 2025-05-07T20:32:34.1274995Z 2025-05-07T20:32:34.1275232Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.1275560Z op = silu_mul_quant 2025-05-07T20:32:34.1275902Z if compiled: 2025-05-07T20:32:34.1276160Z op = torch.compile(op) 2025-05-07T20:32:34.1276463Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.1276818Z 2025-05-07T20:32:34.1277015Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.1277184Z 2025-05-07T20:32:34.1277293Z moe/activation_test.py:117: 2025-05-07T20:32:34.1277595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.1277942Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.1278313Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.1278892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:34.1279481Z return fn(*args, **kwargs) 2025-05-07T20:32:34.1280175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.1280895Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.1281440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.1282158Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.1282890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.1283447Z kernel = self.compile( 2025-05-07T20:32:34.1284005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.1284692Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.1285108Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.1285345Z 2025-05-07T20:32:34.1285556Z self = 2025-05-07T20:32:34.1286678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.1288133Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917880ea0>} 2025-05-07T20:32:34.1289525Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.1290599Z context = 2025-05-07T20:32:34.1290892Z 2025-05-07T20:32:34.1291113Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.1291652Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.1292208Z module_map=module_map) 2025-05-07T20:32:34.1292586Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.1292944Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.1293217Z E ^ 2025-05-07T20:32:34.1293694Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.1294158Z 2025-05-07T20:32:34.1294586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.1295125Z 2025-05-07T20:32:34.1295230Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.1295659Z self=, 2025-05-07T20:32:34.1296073Z T=16384, 2025-05-07T20:32:34.1296265Z D=7168, 2025-05-07T20:32:34.1296463Z scale_ub=1200.0, 2025-05-07T20:32:34.1296690Z contiguous=True, 2025-05-07T20:32:34.1296906Z compiled=True, 2025-05-07T20:32:34.1297171Z ) 2025-05-07T20:32:34.1297500Z self = 2025-05-07T20:32:34.1298008Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:34.1298340Z 2025-05-07T20:32:34.1298419Z @given( 2025-05-07T20:32:34.1298657Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.1298971Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.1299282Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.1299619Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.1300004Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.1300288Z ) 2025-05-07T20:32:34.1300648Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.1301107Z def test_silu_mul_quant( 2025-05-07T20:32:34.1301349Z self, 2025-05-07T20:32:34.1301554Z T: int, 2025-05-07T20:32:34.1301758Z D: int, 2025-05-07T20:32:34.1301974Z scale_ub: Optional[float], 2025-05-07T20:32:34.1302252Z contiguous: bool, 2025-05-07T20:32:34.1302497Z compiled: bool, 2025-05-07T20:32:34.1302722Z ) -> None: 2025-05-07T20:32:34.1302951Z torch.manual_seed(2025) 2025-05-07T20:32:34.1303227Z 2025-05-07T20:32:34.1303516Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.1303873Z 2025-05-07T20:32:34.1304075Z x_sign = torch.sign(x) 2025-05-07T20:32:34.1304377Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.1304700Z x = x_sign * x_clamp 2025-05-07T20:32:34.1304949Z x0 = x[:, :D] 2025-05-07T20:32:34.1305177Z x1 = x[:, D:] 2025-05-07T20:32:34.1305388Z 2025-05-07T20:32:34.1305585Z if contiguous: 2025-05-07T20:32:34.1305823Z x0 = x0.contiguous() 2025-05-07T20:32:34.1306082Z x1 = x1.contiguous() 2025-05-07T20:32:34.1306591Z 2025-05-07T20:32:34.1306791Z if scale_ub is not None: 2025-05-07T20:32:34.1307064Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.1307407Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.1307737Z ) 2025-05-07T20:32:34.1307937Z else: 2025-05-07T20:32:34.1308162Z scale_ub_tensor = None 2025-05-07T20:32:34.1308426Z 2025-05-07T20:32:34.1308658Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.1308985Z op = silu_mul_quant 2025-05-07T20:32:34.1309247Z if compiled: 2025-05-07T20:32:34.1309499Z op = torch.compile(op) 2025-05-07T20:32:34.1309805Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.1310089Z 2025-05-07T20:32:34.1310290Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.1310539Z 2025-05-07T20:32:34.1310643Z moe/activation_test.py:117: 2025-05-07T20:32:34.1310948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.1311291Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.1311575Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.1312153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:34.1312738Z return fn(*args, **kwargs) 2025-05-07T20:32:34.1313419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.1314127Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.1314685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.1315397Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.1316079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.1316637Z kernel = self.compile( 2025-05-07T20:32:34.1317310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.1317994Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.1318506Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.1318757Z 2025-05-07T20:32:34.1318973Z self = 2025-05-07T20:32:34.1320098Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.1321584Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89178820c0>} 2025-05-07T20:32:34.1323027Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.1324095Z context = 2025-05-07T20:32:34.1324399Z 2025-05-07T20:32:34.1324570Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.1325115Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.1325596Z module_map=module_map) 2025-05-07T20:32:34.1325979Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.1326347Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.1326618Z E ^ 2025-05-07T20:32:34.1327097Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.1327570Z 2025-05-07T20:32:34.1328003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.1328532Z 2025-05-07T20:32:34.2551239Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.2551718Z self=, 2025-05-07T20:32:34.2552134Z T=16384, 2025-05-07T20:32:34.2552337Z D=5120, 2025-05-07T20:32:34.2552564Z scale_ub=1200.0, 2025-05-07T20:32:34.2552954Z contiguous=True, 2025-05-07T20:32:34.2553384Z compiled=False, 2025-05-07T20:32:34.2553804Z ) 2025-05-07T20:32:34.2554452Z self = 2025-05-07T20:32:34.2555473Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:34.2556049Z 2025-05-07T20:32:34.2556549Z @given( 2025-05-07T20:32:34.2557020Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.2557645Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.2558256Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.2558918Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.2559575Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.2560145Z ) 2025-05-07T20:32:34.2560857Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.2561753Z def test_silu_mul_quant( 2025-05-07T20:32:34.2562239Z self, 2025-05-07T20:32:34.2562614Z T: int, 2025-05-07T20:32:34.2562943Z D: int, 2025-05-07T20:32:34.2563205Z scale_ub: Optional[float], 2025-05-07T20:32:34.2563486Z contiguous: bool, 2025-05-07T20:32:34.2563732Z compiled: bool, 2025-05-07T20:32:34.2563979Z ) -> None: 2025-05-07T20:32:34.2564206Z torch.manual_seed(2025) 2025-05-07T20:32:34.2564457Z 2025-05-07T20:32:34.2564740Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.2565086Z 2025-05-07T20:32:34.2565362Z x_sign = torch.sign(x) 2025-05-07T20:32:34.2565662Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.2574460Z x = x_sign * x_clamp 2025-05-07T20:32:34.2574843Z x0 = x[:, :D] 2025-05-07T20:32:34.2575070Z x1 = x[:, D:] 2025-05-07T20:32:34.2575285Z 2025-05-07T20:32:34.2575469Z if contiguous: 2025-05-07T20:32:34.2575712Z x0 = x0.contiguous() 2025-05-07T20:32:34.2575980Z x1 = x1.contiguous() 2025-05-07T20:32:34.2576224Z 2025-05-07T20:32:34.2576415Z if scale_ub is not None: 2025-05-07T20:32:34.2576785Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.2577135Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.2577447Z ) 2025-05-07T20:32:34.2577659Z else: 2025-05-07T20:32:34.2577879Z scale_ub_tensor = None 2025-05-07T20:32:34.2578127Z 2025-05-07T20:32:34.2578369Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.2578695Z op = silu_mul_quant 2025-05-07T20:32:34.2578950Z if compiled: 2025-05-07T20:32:34.2579213Z op = torch.compile(op) 2025-05-07T20:32:34.2579524Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.2579802Z 2025-05-07T20:32:34.2580007Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.2580175Z 2025-05-07T20:32:34.2580289Z moe/activation_test.py:117: 2025-05-07T20:32:34.2580598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.2580940Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.2581232Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.2581950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.2582711Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.2583274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.2583979Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.2584663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.2585212Z kernel = self.compile( 2025-05-07T20:32:34.2585772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.2586451Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.2586858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.2587098Z 2025-05-07T20:32:34.2587311Z self = 2025-05-07T20:32:34.2588483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.2589926Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917881a80>} 2025-05-07T20:32:34.2591321Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.2592375Z context = 2025-05-07T20:32:34.2592681Z 2025-05-07T20:32:34.2592851Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.2593395Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.2593879Z module_map=module_map) 2025-05-07T20:32:34.2594297Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.2594661Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.2594929Z E ^ 2025-05-07T20:32:34.2595402Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.2595915Z 2025-05-07T20:32:34.2596344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.2596880Z 2025-05-07T20:32:34.2596990Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.2597457Z self=, 2025-05-07T20:32:34.2597867Z T=1, 2025-05-07T20:32:34.2598055Z D=7168, 2025-05-07T20:32:34.2598258Z scale_ub=1200.0, 2025-05-07T20:32:34.2598487Z contiguous=False, 2025-05-07T20:32:34.2598719Z compiled=False, 2025-05-07T20:32:34.2598931Z ) 2025-05-07T20:32:34.2599255Z self = 2025-05-07T20:32:34.2599764Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:34.2600040Z 2025-05-07T20:32:34.2600128Z @given( 2025-05-07T20:32:34.2600360Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.2600689Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.2601004Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.2601345Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.2601681Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.2601982Z ) 2025-05-07T20:32:34.2602344Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.2602804Z def test_silu_mul_quant( 2025-05-07T20:32:34.2603056Z self, 2025-05-07T20:32:34.2603261Z T: int, 2025-05-07T20:32:34.2603461Z D: int, 2025-05-07T20:32:34.2603688Z scale_ub: Optional[float], 2025-05-07T20:32:34.2603973Z contiguous: bool, 2025-05-07T20:32:34.2604221Z compiled: bool, 2025-05-07T20:32:34.2604452Z ) -> None: 2025-05-07T20:32:34.2604677Z torch.manual_seed(2025) 2025-05-07T20:32:34.2604923Z 2025-05-07T20:32:34.2605211Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.2605565Z 2025-05-07T20:32:34.2605765Z x_sign = torch.sign(x) 2025-05-07T20:32:34.2606059Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.2606689Z x = x_sign * x_clamp 2025-05-07T20:32:34.2606940Z x0 = x[:, :D] 2025-05-07T20:32:34.2607162Z x1 = x[:, D:] 2025-05-07T20:32:34.2607375Z 2025-05-07T20:32:34.2607566Z if contiguous: 2025-05-07T20:32:34.2607796Z x0 = x0.contiguous() 2025-05-07T20:32:34.2608143Z x1 = x1.contiguous() 2025-05-07T20:32:34.2608387Z 2025-05-07T20:32:34.2608575Z if scale_ub is not None: 2025-05-07T20:32:34.2608851Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.2609193Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.2609501Z ) 2025-05-07T20:32:34.2609701Z else: 2025-05-07T20:32:34.2609917Z scale_ub_tensor = None 2025-05-07T20:32:34.2610169Z 2025-05-07T20:32:34.2610408Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.2610731Z op = silu_mul_quant 2025-05-07T20:32:34.2610990Z if compiled: 2025-05-07T20:32:34.2611240Z op = torch.compile(op) 2025-05-07T20:32:34.2611541Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.2611902Z 2025-05-07T20:32:34.2612098Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.2612270Z 2025-05-07T20:32:34.2612371Z moe/activation_test.py:117: 2025-05-07T20:32:34.2612677Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.2613013Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.2613374Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.2614089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.2614801Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.2615408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.2616114Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.2616799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.2617407Z kernel = self.compile( 2025-05-07T20:32:34.2617964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.2618645Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.2619058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.2619294Z 2025-05-07T20:32:34.2619505Z self = 2025-05-07T20:32:34.2620625Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.2622051Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89177080e0>} 2025-05-07T20:32:34.2623442Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.2624503Z context = 2025-05-07T20:32:34.2624798Z 2025-05-07T20:32:34.2624972Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.2625512Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.2625997Z module_map=module_map) 2025-05-07T20:32:34.2626363Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.2626728Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.2626994Z E ^ 2025-05-07T20:32:34.2627466Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.2627942Z 2025-05-07T20:32:34.2628370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.2628906Z 2025-05-07T20:32:34.4338764Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.4339234Z self=, 2025-05-07T20:32:34.4339662Z T=4096, 2025-05-07T20:32:34.4339844Z D=7168, 2025-05-07T20:32:34.4340037Z scale_ub=1200.0, 2025-05-07T20:32:34.4340260Z contiguous=False, 2025-05-07T20:32:34.4340476Z compiled=True, 2025-05-07T20:32:34.4340689Z ) 2025-05-07T20:32:34.4341016Z self = 2025-05-07T20:32:34.4341523Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:34.4341811Z 2025-05-07T20:32:34.4341892Z @given( 2025-05-07T20:32:34.4342126Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.4342451Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.4342758Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.4343100Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.4343438Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.4343731Z ) 2025-05-07T20:32:34.4344171Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.4344640Z def test_silu_mul_quant( 2025-05-07T20:32:34.4344886Z self, 2025-05-07T20:32:34.4345078Z T: int, 2025-05-07T20:32:34.4345361Z D: int, 2025-05-07T20:32:34.4345583Z scale_ub: Optional[float], 2025-05-07T20:32:34.4345854Z contiguous: bool, 2025-05-07T20:32:34.4346101Z compiled: bool, 2025-05-07T20:32:34.4346332Z ) -> None: 2025-05-07T20:32:34.4346546Z torch.manual_seed(2025) 2025-05-07T20:32:34.4346792Z 2025-05-07T20:32:34.4347150Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.4347495Z 2025-05-07T20:32:34.4347694Z x_sign = torch.sign(x) 2025-05-07T20:32:34.4347993Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.4348318Z x = x_sign * x_clamp 2025-05-07T20:32:34.4348560Z x0 = x[:, :D] 2025-05-07T20:32:34.4348784Z x1 = x[:, D:] 2025-05-07T20:32:34.4348996Z 2025-05-07T20:32:34.4349184Z if contiguous: 2025-05-07T20:32:34.4349422Z x0 = x0.contiguous() 2025-05-07T20:32:34.4349689Z x1 = x1.contiguous() 2025-05-07T20:32:34.4349930Z 2025-05-07T20:32:34.4350129Z if scale_ub is not None: 2025-05-07T20:32:34.4350405Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.4350741Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.4351058Z ) 2025-05-07T20:32:34.4351262Z else: 2025-05-07T20:32:34.4351472Z scale_ub_tensor = None 2025-05-07T20:32:34.4351736Z 2025-05-07T20:32:34.4351971Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.4352313Z op = silu_mul_quant 2025-05-07T20:32:34.4352592Z if compiled: 2025-05-07T20:32:34.4352845Z op = torch.compile(op) 2025-05-07T20:32:34.4353151Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.4353425Z 2025-05-07T20:32:34.4353622Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.4353808Z 2025-05-07T20:32:34.4353909Z moe/activation_test.py:117: 2025-05-07T20:32:34.4354214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.4354550Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.4354840Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.4355419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:34.4355992Z return fn(*args, **kwargs) 2025-05-07T20:32:34.4356674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.4357384Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.4358019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.4358718Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.4359407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.4359958Z kernel = self.compile( 2025-05-07T20:32:34.4360508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.4361184Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.4361595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.4361833Z 2025-05-07T20:32:34.4362051Z self = 2025-05-07T20:32:34.4363164Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.4364647Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917709300>} 2025-05-07T20:32:34.4366038Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.4367138Z context = 2025-05-07T20:32:34.4367433Z 2025-05-07T20:32:34.4367609Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.4368180Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.4368658Z module_map=module_map) 2025-05-07T20:32:34.4369031Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.4369386Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.4369652Z E ^ 2025-05-07T20:32:34.4370133Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.4370595Z 2025-05-07T20:32:34.4371030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.4371560Z 2025-05-07T20:32:34.4371663Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.4372183Z self=, 2025-05-07T20:32:34.4372597Z T=128, 2025-05-07T20:32:34.4372785Z D=7168, 2025-05-07T20:32:34.4372980Z scale_ub=1200.0, 2025-05-07T20:32:34.4373208Z contiguous=False, 2025-05-07T20:32:34.4373434Z compiled=True, 2025-05-07T20:32:34.4373644Z ) 2025-05-07T20:32:34.5280631Z self = 2025-05-07T20:32:34.5281724Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:34.5282297Z 2025-05-07T20:32:34.5282468Z @given( 2025-05-07T20:32:34.5282915Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.5283309Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.5283627Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.5283957Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.5284299Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.5284596Z ) 2025-05-07T20:32:34.5284952Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.5285419Z def test_silu_mul_quant( 2025-05-07T20:32:34.5285670Z self, 2025-05-07T20:32:34.5285870Z T: int, 2025-05-07T20:32:34.5286074Z D: int, 2025-05-07T20:32:34.5286527Z scale_ub: Optional[float], 2025-05-07T20:32:34.5286808Z contiguous: bool, 2025-05-07T20:32:34.5287054Z compiled: bool, 2025-05-07T20:32:34.5287292Z ) -> None: 2025-05-07T20:32:34.5287515Z torch.manual_seed(2025) 2025-05-07T20:32:34.5287757Z 2025-05-07T20:32:34.5288038Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.5288394Z 2025-05-07T20:32:34.5288596Z x_sign = torch.sign(x) 2025-05-07T20:32:34.5288896Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.5289214Z x = x_sign * x_clamp 2025-05-07T20:32:34.5289453Z x0 = x[:, :D] 2025-05-07T20:32:34.5289677Z x1 = x[:, D:] 2025-05-07T20:32:34.5289892Z 2025-05-07T20:32:34.5290080Z if contiguous: 2025-05-07T20:32:34.5290324Z x0 = x0.contiguous() 2025-05-07T20:32:34.5290591Z x1 = x1.contiguous() 2025-05-07T20:32:34.5290831Z 2025-05-07T20:32:34.5291029Z if scale_ub is not None: 2025-05-07T20:32:34.5291312Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.5291654Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.5292051Z ) 2025-05-07T20:32:34.5292340Z else: 2025-05-07T20:32:34.5292592Z scale_ub_tensor = None 2025-05-07T20:32:34.5292862Z 2025-05-07T20:32:34.5293102Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.5293498Z op = silu_mul_quant 2025-05-07T20:32:34.5293750Z if compiled: 2025-05-07T20:32:34.5294005Z op = torch.compile(op) 2025-05-07T20:32:34.5294313Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.5294590Z 2025-05-07T20:32:34.5294789Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.5295034Z 2025-05-07T20:32:34.5295143Z moe/activation_test.py:117: 2025-05-07T20:32:34.5295447Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.5295799Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.5296094Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.5296682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:34.5297265Z return fn(*args, **kwargs) 2025-05-07T20:32:34.5297954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.5298674Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.5299227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.5299938Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.5300632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.5301187Z kernel = self.compile( 2025-05-07T20:32:34.5301750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.5302439Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.5302858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.5303097Z 2025-05-07T20:32:34.5303317Z self = 2025-05-07T20:32:34.5304439Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.5305884Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f891770a020>} 2025-05-07T20:32:34.5307610Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.5308684Z context = 2025-05-07T20:32:34.5308983Z 2025-05-07T20:32:34.5309154Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.5309699Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.5310185Z module_map=module_map) 2025-05-07T20:32:34.5310561Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.5310920Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.5311186Z E ^ 2025-05-07T20:32:34.5311666Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.5312137Z 2025-05-07T20:32:34.5312573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.5313113Z 2025-05-07T20:32:34.5313220Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.5313713Z self=, 2025-05-07T20:32:34.5314135Z T=2048, 2025-05-07T20:32:34.5314324Z D=7168, 2025-05-07T20:32:34.5314522Z scale_ub=None, 2025-05-07T20:32:34.5314745Z contiguous=True, 2025-05-07T20:32:34.5315032Z compiled=True, 2025-05-07T20:32:34.5315242Z ) 2025-05-07T20:32:34.5315574Z self = 2025-05-07T20:32:34.5316082Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:34.5316367Z 2025-05-07T20:32:34.5316511Z @given( 2025-05-07T20:32:34.5316751Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.5317076Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.5317388Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.5317731Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.5318073Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.5318362Z ) 2025-05-07T20:32:34.5318733Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.5319198Z def test_silu_mul_quant( 2025-05-07T20:32:34.5319446Z self, 2025-05-07T20:32:34.5319655Z T: int, 2025-05-07T20:32:34.5319862Z D: int, 2025-05-07T20:32:34.5320083Z scale_ub: Optional[float], 2025-05-07T20:32:34.5320373Z contiguous: bool, 2025-05-07T20:32:34.5320624Z compiled: bool, 2025-05-07T20:32:34.5320851Z ) -> None: 2025-05-07T20:32:34.5321079Z torch.manual_seed(2025) 2025-05-07T20:32:34.5321334Z 2025-05-07T20:32:34.5321611Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.5321969Z 2025-05-07T20:32:34.5322170Z x_sign = torch.sign(x) 2025-05-07T20:32:34.5322478Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.5322802Z x = x_sign * x_clamp 2025-05-07T20:32:34.5323051Z x0 = x[:, :D] 2025-05-07T20:32:34.5323284Z x1 = x[:, D:] 2025-05-07T20:32:34.5323496Z 2025-05-07T20:32:34.5323688Z if contiguous: 2025-05-07T20:32:34.5323923Z x0 = x0.contiguous() 2025-05-07T20:32:34.5324186Z x1 = x1.contiguous() 2025-05-07T20:32:34.5324434Z 2025-05-07T20:32:34.5324631Z if scale_ub is not None: 2025-05-07T20:32:34.5324909Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.5325254Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.5325574Z ) 2025-05-07T20:32:34.5325771Z else: 2025-05-07T20:32:34.5325989Z scale_ub_tensor = None 2025-05-07T20:32:34.5326247Z 2025-05-07T20:32:34.5326485Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.5326812Z op = silu_mul_quant 2025-05-07T20:32:34.5327124Z if compiled: 2025-05-07T20:32:34.5327381Z op = torch.compile(op) 2025-05-07T20:32:34.5327686Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.5327969Z 2025-05-07T20:32:34.5328166Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.5328334Z 2025-05-07T20:32:34.5328435Z moe/activation_test.py:117: 2025-05-07T20:32:34.5328747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.5329095Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.5329381Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.5329964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:34.5330548Z return fn(*args, **kwargs) 2025-05-07T20:32:34.5331235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.5332012Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.5332583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.5333389Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.5334079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.5334671Z kernel = self.compile( 2025-05-07T20:32:34.5335234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.5335928Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.5336339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.5336655Z 2025-05-07T20:32:34.5336874Z self = 2025-05-07T20:32:34.5338005Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.5339433Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f891770b240>} 2025-05-07T20:32:34.5340836Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.5349784Z context = 2025-05-07T20:32:34.5350104Z 2025-05-07T20:32:34.5350280Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.5350828Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.5351312Z module_map=module_map) 2025-05-07T20:32:34.5351686Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.5352058Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.5352354Z E ^ 2025-05-07T20:32:34.5352864Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.5353335Z 2025-05-07T20:32:34.5353776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.5354319Z 2025-05-07T20:32:34.5981850Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.5982303Z self=, 2025-05-07T20:32:34.5982738Z T=16384, 2025-05-07T20:32:34.5982934Z D=5120, 2025-05-07T20:32:34.5983128Z scale_ub=None, 2025-05-07T20:32:34.5983341Z contiguous=False, 2025-05-07T20:32:34.5983560Z compiled=False, 2025-05-07T20:32:34.5983999Z ) 2025-05-07T20:32:34.5984332Z self = 2025-05-07T20:32:34.5984855Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:34.5985148Z 2025-05-07T20:32:34.5985228Z @given( 2025-05-07T20:32:34.5985467Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.5985791Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.5986098Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.5986437Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.5986775Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.5987061Z ) 2025-05-07T20:32:34.5987426Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.5987880Z def test_silu_mul_quant( 2025-05-07T20:32:34.5988121Z self, 2025-05-07T20:32:34.5988326Z T: int, 2025-05-07T20:32:34.5988532Z D: int, 2025-05-07T20:32:34.5988761Z scale_ub: Optional[float], 2025-05-07T20:32:34.5989033Z contiguous: bool, 2025-05-07T20:32:34.5989282Z compiled: bool, 2025-05-07T20:32:34.5989603Z ) -> None: 2025-05-07T20:32:34.5989825Z torch.manual_seed(2025) 2025-05-07T20:32:34.5990078Z 2025-05-07T20:32:34.5990358Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.5990771Z 2025-05-07T20:32:34.5990970Z x_sign = torch.sign(x) 2025-05-07T20:32:34.5991266Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.5993374Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.5995420Z 2025-05-07T20:32:34.5995551Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:34.5995770Z 2025-05-07T20:32:34.5995874Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.5996301Z self=, 2025-05-07T20:32:34.5996718Z T=4096, 2025-05-07T20:32:34.5996914Z D=7168, 2025-05-07T20:32:34.5997108Z scale_ub=1200.0, 2025-05-07T20:32:34.5997338Z contiguous=True, 2025-05-07T20:32:34.5997567Z compiled=True, 2025-05-07T20:32:34.5997775Z ) 2025-05-07T20:32:34.5998110Z self = 2025-05-07T20:32:34.5998624Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:34.5998905Z 2025-05-07T20:32:34.5998992Z @given( 2025-05-07T20:32:34.5999228Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.5999553Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.5999867Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.6000226Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.6000571Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.6000872Z ) 2025-05-07T20:32:34.6001235Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.6001689Z def test_silu_mul_quant( 2025-05-07T20:32:34.6001940Z self, 2025-05-07T20:32:34.6002139Z T: int, 2025-05-07T20:32:34.6002341Z D: int, 2025-05-07T20:32:34.6002570Z scale_ub: Optional[float], 2025-05-07T20:32:34.6002846Z contiguous: bool, 2025-05-07T20:32:34.6003087Z compiled: bool, 2025-05-07T20:32:34.6003393Z ) -> None: 2025-05-07T20:32:34.6003656Z torch.manual_seed(2025) 2025-05-07T20:32:34.6003977Z 2025-05-07T20:32:34.6004259Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.6004609Z 2025-05-07T20:32:34.6004803Z x_sign = torch.sign(x) 2025-05-07T20:32:34.6005099Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.6007498Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.6009476Z 2025-05-07T20:32:34.6009600Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:34.6009821Z 2025-05-07T20:32:34.6009932Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.6010440Z self=, 2025-05-07T20:32:34.6010860Z T=16384, 2025-05-07T20:32:34.6011059Z D=7168, 2025-05-07T20:32:34.6011249Z scale_ub=None, 2025-05-07T20:32:34.6011466Z contiguous=False, 2025-05-07T20:32:34.6011755Z compiled=False, 2025-05-07T20:32:34.6012049Z ) 2025-05-07T20:32:34.6012375Z self = 2025-05-07T20:32:34.6012891Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:34.6013179Z 2025-05-07T20:32:34.6013265Z @given( 2025-05-07T20:32:34.6013561Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.6013884Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.6014199Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.6014533Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.6014872Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.6015166Z ) 2025-05-07T20:32:34.6015521Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.6015984Z def test_silu_mul_quant( 2025-05-07T20:32:34.6016229Z self, 2025-05-07T20:32:34.6016427Z T: int, 2025-05-07T20:32:34.6016632Z D: int, 2025-05-07T20:32:34.6016855Z scale_ub: Optional[float], 2025-05-07T20:32:34.6017132Z contiguous: bool, 2025-05-07T20:32:34.6017370Z compiled: bool, 2025-05-07T20:32:34.6017599Z ) -> None: 2025-05-07T20:32:34.6017821Z torch.manual_seed(2025) 2025-05-07T20:32:34.6018063Z 2025-05-07T20:32:34.6018345Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.6020501Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.6022494Z 2025-05-07T20:32:34.6022637Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:34.6022854Z 2025-05-07T20:32:34.6022966Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.6023385Z self=, 2025-05-07T20:32:34.6023803Z T=2048, 2025-05-07T20:32:34.6023998Z D=7168, 2025-05-07T20:32:34.6024188Z scale_ub=1200.0, 2025-05-07T20:32:34.6024413Z contiguous=True, 2025-05-07T20:32:34.6024638Z compiled=True, 2025-05-07T20:32:34.6024839Z ) 2025-05-07T20:32:34.6025238Z self = 2025-05-07T20:32:34.6025753Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:34.6026031Z 2025-05-07T20:32:34.6026112Z @given( 2025-05-07T20:32:34.6026346Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.6026662Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.6026978Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.6027309Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.6027644Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.6027939Z ) 2025-05-07T20:32:34.6028292Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.6028750Z def test_silu_mul_quant( 2025-05-07T20:32:34.6028998Z self, 2025-05-07T20:32:34.6029193Z T: int, 2025-05-07T20:32:34.6029395Z D: int, 2025-05-07T20:32:34.6029623Z scale_ub: Optional[float], 2025-05-07T20:32:34.6029894Z contiguous: bool, 2025-05-07T20:32:34.6030141Z compiled: bool, 2025-05-07T20:32:34.6030416Z ) -> None: 2025-05-07T20:32:34.6030634Z torch.manual_seed(2025) 2025-05-07T20:32:34.6030882Z 2025-05-07T20:32:34.6031161Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.6031545Z 2025-05-07T20:32:34.6031743Z x_sign = torch.sign(x) 2025-05-07T20:32:34.6032039Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.6034118Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.6036091Z 2025-05-07T20:32:34.6036223Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:34.6036436Z 2025-05-07T20:32:34.6036542Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.6036965Z self=, 2025-05-07T20:32:34.6037386Z T=2048, 2025-05-07T20:32:34.6037573Z D=7168, 2025-05-07T20:32:34.6037770Z scale_ub=None, 2025-05-07T20:32:34.6037991Z contiguous=True, 2025-05-07T20:32:34.6038220Z compiled=False, 2025-05-07T20:32:34.6038424Z ) 2025-05-07T20:32:34.7159478Z self = 2025-05-07T20:32:34.7160604Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:34.7161163Z 2025-05-07T20:32:34.7161332Z @given( 2025-05-07T20:32:34.7161792Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.7162415Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.7163030Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.7163408Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.7163742Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.7164037Z ) 2025-05-07T20:32:34.7164402Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.7164848Z def test_silu_mul_quant( 2025-05-07T20:32:34.7165094Z self, 2025-05-07T20:32:34.7165292Z T: int, 2025-05-07T20:32:34.7165487Z D: int, 2025-05-07T20:32:34.7165710Z scale_ub: Optional[float], 2025-05-07T20:32:34.7165989Z contiguous: bool, 2025-05-07T20:32:34.7166229Z compiled: bool, 2025-05-07T20:32:34.7166457Z ) -> None: 2025-05-07T20:32:34.7166677Z torch.manual_seed(2025) 2025-05-07T20:32:34.7166919Z 2025-05-07T20:32:34.7167440Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.7167796Z 2025-05-07T20:32:34.7167993Z > x_sign = torch.sign(x) 2025-05-07T20:32:34.7170038Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.7172080Z 2025-05-07T20:32:34.7172200Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:34.7172422Z 2025-05-07T20:32:34.7172523Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.7172948Z self=, 2025-05-07T20:32:34.7173356Z T=1, 2025-05-07T20:32:34.7173542Z D=7168, 2025-05-07T20:32:34.7173813Z scale_ub=1200.0, 2025-05-07T20:32:34.7174055Z contiguous=True, 2025-05-07T20:32:34.7174272Z compiled=False, 2025-05-07T20:32:34.7174483Z ) 2025-05-07T20:32:34.7174809Z self = 2025-05-07T20:32:34.7175413Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:34.7175687Z 2025-05-07T20:32:34.7175765Z @given( 2025-05-07T20:32:34.7175996Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.7176313Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.7176696Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.7177031Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.7177367Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.7177650Z ) 2025-05-07T20:32:34.7178008Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.7178464Z def test_silu_mul_quant( 2025-05-07T20:32:34.7178708Z self, 2025-05-07T20:32:34.7178901Z T: int, 2025-05-07T20:32:34.7179100Z D: int, 2025-05-07T20:32:34.7179321Z scale_ub: Optional[float], 2025-05-07T20:32:34.7179593Z contiguous: bool, 2025-05-07T20:32:34.7179837Z compiled: bool, 2025-05-07T20:32:34.7180055Z ) -> None: 2025-05-07T20:32:34.7180266Z torch.manual_seed(2025) 2025-05-07T20:32:34.7180510Z 2025-05-07T20:32:34.7180780Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.7181119Z 2025-05-07T20:32:34.7181313Z x_sign = torch.sign(x) 2025-05-07T20:32:34.7181606Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.7181915Z x = x_sign * x_clamp 2025-05-07T20:32:34.7182163Z x0 = x[:, :D] 2025-05-07T20:32:34.7182380Z x1 = x[:, D:] 2025-05-07T20:32:34.7182582Z 2025-05-07T20:32:34.7182766Z if contiguous: 2025-05-07T20:32:34.7183003Z x0 = x0.contiguous() 2025-05-07T20:32:34.7183260Z x1 = x1.contiguous() 2025-05-07T20:32:34.7183507Z 2025-05-07T20:32:34.7183701Z if scale_ub is not None: 2025-05-07T20:32:34.7183974Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.7184308Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.7184624Z ) 2025-05-07T20:32:34.7184823Z else: 2025-05-07T20:32:34.7185036Z scale_ub_tensor = None 2025-05-07T20:32:34.7185293Z 2025-05-07T20:32:34.7185534Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.7185859Z op = silu_mul_quant 2025-05-07T20:32:34.7186117Z if compiled: 2025-05-07T20:32:34.7186375Z op = torch.compile(op) 2025-05-07T20:32:34.7186726Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.7187015Z 2025-05-07T20:32:34.7187213Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.7187382Z 2025-05-07T20:32:34.7187486Z moe/activation_test.py:117: 2025-05-07T20:32:34.7187792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.7188136Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.7188429Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.7189146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.7189866Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.7190426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.7191136Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.7191831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.7192387Z kernel = self.compile( 2025-05-07T20:32:34.7192995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.7193673Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.7194085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.7194364Z 2025-05-07T20:32:34.7194582Z self = 2025-05-07T20:32:34.7195705Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.7197176Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917552520>} 2025-05-07T20:32:34.7198582Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.7199650Z context = 2025-05-07T20:32:34.7199949Z 2025-05-07T20:32:34.7200126Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.7200660Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.7201143Z module_map=module_map) 2025-05-07T20:32:34.7201520Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.7201889Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.7202148Z E ^ 2025-05-07T20:32:34.7202630Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.7203096Z 2025-05-07T20:32:34.7203543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.7204078Z 2025-05-07T20:32:34.7204188Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.7204611Z self=, 2025-05-07T20:32:34.7205031Z T=128, 2025-05-07T20:32:34.7205225Z D=5120, 2025-05-07T20:32:34.7205415Z scale_ub=None, 2025-05-07T20:32:34.7205633Z contiguous=True, 2025-05-07T20:32:34.7205867Z compiled=False, 2025-05-07T20:32:34.7206072Z ) 2025-05-07T20:32:34.7879555Z self = 2025-05-07T20:32:34.7880679Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:34.7881238Z 2025-05-07T20:32:34.7881395Z @given( 2025-05-07T20:32:34.7882171Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.7882797Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.7883218Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.7883559Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.7883891Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.7884173Z ) 2025-05-07T20:32:34.7884531Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.7884995Z def test_silu_mul_quant( 2025-05-07T20:32:34.7885239Z self, 2025-05-07T20:32:34.7885440Z T: int, 2025-05-07T20:32:34.7885644Z D: int, 2025-05-07T20:32:34.7885860Z scale_ub: Optional[float], 2025-05-07T20:32:34.7886140Z contiguous: bool, 2025-05-07T20:32:34.7886386Z compiled: bool, 2025-05-07T20:32:34.7886611Z ) -> None: 2025-05-07T20:32:34.7886831Z torch.manual_seed(2025) 2025-05-07T20:32:34.7887077Z 2025-05-07T20:32:34.7887356Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.7887704Z 2025-05-07T20:32:34.7887900Z x_sign = torch.sign(x) 2025-05-07T20:32:34.7888281Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.7888597Z x = x_sign * x_clamp 2025-05-07T20:32:34.7888846Z x0 = x[:, :D] 2025-05-07T20:32:34.7889069Z x1 = x[:, D:] 2025-05-07T20:32:34.7889337Z 2025-05-07T20:32:34.7889526Z if contiguous: 2025-05-07T20:32:34.7889761Z x0 = x0.contiguous() 2025-05-07T20:32:34.7890016Z x1 = x1.contiguous() 2025-05-07T20:32:34.7890258Z 2025-05-07T20:32:34.7890450Z if scale_ub is not None: 2025-05-07T20:32:34.7890718Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.7891131Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.7891444Z ) 2025-05-07T20:32:34.7891633Z else: 2025-05-07T20:32:34.7891929Z scale_ub_tensor = None 2025-05-07T20:32:34.7892189Z 2025-05-07T20:32:34.7892418Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.7892740Z op = silu_mul_quant 2025-05-07T20:32:34.7892996Z if compiled: 2025-05-07T20:32:34.7893244Z op = torch.compile(op) 2025-05-07T20:32:34.7893543Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.7893824Z 2025-05-07T20:32:34.7894018Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.7894191Z 2025-05-07T20:32:34.7894290Z moe/activation_test.py:117: 2025-05-07T20:32:34.7894587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.7894923Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.7895203Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.7895918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.7896634Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.7897181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.7897892Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.7898580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.7899131Z kernel = self.compile( 2025-05-07T20:32:34.7899685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.7900364Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.7900773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.7901012Z 2025-05-07T20:32:34.7901227Z self = 2025-05-07T20:32:34.7902395Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.7903831Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917553420>} 2025-05-07T20:32:34.7905224Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.7906607Z context = 2025-05-07T20:32:34.7906915Z 2025-05-07T20:32:34.7907098Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.7907647Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.7908135Z module_map=module_map) 2025-05-07T20:32:34.7908514Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.7908956Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.7909234Z E ^ 2025-05-07T20:32:34.7909718Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.7910244Z 2025-05-07T20:32:34.7910683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.7911217Z 2025-05-07T20:32:34.7911326Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.7911754Z self=, 2025-05-07T20:32:34.7912237Z T=128, 2025-05-07T20:32:34.7912429Z D=7168, 2025-05-07T20:32:34.7912628Z scale_ub=None, 2025-05-07T20:32:34.7912849Z contiguous=True, 2025-05-07T20:32:34.7913074Z compiled=False, 2025-05-07T20:32:34.7913287Z ) 2025-05-07T20:32:34.7913619Z self = 2025-05-07T20:32:34.7914126Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:34.7914411Z 2025-05-07T20:32:34.7914490Z @given( 2025-05-07T20:32:34.7914726Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.7915050Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.7915363Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.7915701Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.7916038Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.7916327Z ) 2025-05-07T20:32:34.7916688Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.7917150Z def test_silu_mul_quant( 2025-05-07T20:32:34.7917397Z self, 2025-05-07T20:32:34.7917604Z T: int, 2025-05-07T20:32:34.7917811Z D: int, 2025-05-07T20:32:34.7918034Z scale_ub: Optional[float], 2025-05-07T20:32:34.7918313Z contiguous: bool, 2025-05-07T20:32:34.7918563Z compiled: bool, 2025-05-07T20:32:34.7918798Z ) -> None: 2025-05-07T20:32:34.7919018Z torch.manual_seed(2025) 2025-05-07T20:32:34.7919274Z 2025-05-07T20:32:34.7919558Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.7919910Z 2025-05-07T20:32:34.7920112Z x_sign = torch.sign(x) 2025-05-07T20:32:34.7920413Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.7920732Z x = x_sign * x_clamp 2025-05-07T20:32:34.7920994Z x0 = x[:, :D] 2025-05-07T20:32:34.7921225Z x1 = x[:, D:] 2025-05-07T20:32:34.7921440Z 2025-05-07T20:32:34.7921637Z if contiguous: 2025-05-07T20:32:34.7921879Z x0 = x0.contiguous() 2025-05-07T20:32:34.7930952Z x1 = x1.contiguous() 2025-05-07T20:32:34.7931222Z 2025-05-07T20:32:34.7931547Z if scale_ub is not None: 2025-05-07T20:32:34.7931906Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.7932260Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.7932577Z ) 2025-05-07T20:32:34.7932772Z else: 2025-05-07T20:32:34.7932983Z scale_ub_tensor = None 2025-05-07T20:32:34.7933234Z 2025-05-07T20:32:34.7933473Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.7933802Z op = silu_mul_quant 2025-05-07T20:32:34.7934058Z if compiled: 2025-05-07T20:32:34.7934312Z op = torch.compile(op) 2025-05-07T20:32:34.7934615Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.7934890Z 2025-05-07T20:32:34.7935091Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.7935264Z 2025-05-07T20:32:34.7935368Z moe/activation_test.py:117: 2025-05-07T20:32:34.7935675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.7936018Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.7936310Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.7937176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.7937894Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.7938451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.7939977Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.7940674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.7941272Z kernel = self.compile( 2025-05-07T20:32:34.7941843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.7942535Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.7942948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.7943193Z 2025-05-07T20:32:34.7943409Z self = 2025-05-07T20:32:34.7944543Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.7945996Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89172984a0>} 2025-05-07T20:32:34.7947397Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.7948465Z context = 2025-05-07T20:32:34.7948771Z 2025-05-07T20:32:34.7948945Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.7949497Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.7949985Z module_map=module_map) 2025-05-07T20:32:34.7950364Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.7950738Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.7951010Z E ^ 2025-05-07T20:32:34.7951493Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.7951969Z 2025-05-07T20:32:34.7952407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.7952979Z 2025-05-07T20:32:34.7953105Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.7953587Z self=, 2025-05-07T20:32:34.7954004Z T=2048, 2025-05-07T20:32:34.7954198Z D=7168, 2025-05-07T20:32:34.7954400Z scale_ub=1200.0, 2025-05-07T20:32:34.7954623Z contiguous=True, 2025-05-07T20:32:34.7954855Z compiled=False, 2025-05-07T20:32:34.7955067Z ) 2025-05-07T20:32:34.8749263Z self = 2025-05-07T20:32:34.8750009Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:34.8750380Z 2025-05-07T20:32:34.8750480Z @given( 2025-05-07T20:32:34.8750775Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.8751098Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.8751413Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.8751751Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.8752086Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.8752384Z ) 2025-05-07T20:32:34.8752739Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.8753465Z def test_silu_mul_quant( 2025-05-07T20:32:34.8753724Z self, 2025-05-07T20:32:34.8753923Z T: int, 2025-05-07T20:32:34.8754130Z D: int, 2025-05-07T20:32:34.8754358Z scale_ub: Optional[float], 2025-05-07T20:32:34.8754748Z contiguous: bool, 2025-05-07T20:32:34.8754996Z compiled: bool, 2025-05-07T20:32:34.8755230Z ) -> None: 2025-05-07T20:32:34.8755457Z torch.manual_seed(2025) 2025-05-07T20:32:34.8755716Z 2025-05-07T20:32:34.8755996Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.8758252Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.8760335Z 2025-05-07T20:32:34.8760462Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:34.8760694Z 2025-05-07T20:32:34.8760799Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.8761233Z self=, 2025-05-07T20:32:34.8761654Z T=1, 2025-05-07T20:32:34.8761842Z D=5120, 2025-05-07T20:32:34.8762047Z scale_ub=1200.0, 2025-05-07T20:32:34.8762286Z contiguous=True, 2025-05-07T20:32:34.8762511Z compiled=False, 2025-05-07T20:32:34.8762730Z ) 2025-05-07T20:32:34.8763063Z self = 2025-05-07T20:32:34.8763569Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:34.8763852Z 2025-05-07T20:32:34.8763934Z @given( 2025-05-07T20:32:34.8764172Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.8764498Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.8764813Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.8765153Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.8765492Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.8765780Z ) 2025-05-07T20:32:34.8766143Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.8766603Z def test_silu_mul_quant( 2025-05-07T20:32:34.8766850Z self, 2025-05-07T20:32:34.8767052Z T: int, 2025-05-07T20:32:34.8767258Z D: int, 2025-05-07T20:32:34.8767479Z scale_ub: Optional[float], 2025-05-07T20:32:34.8767759Z contiguous: bool, 2025-05-07T20:32:34.8768098Z compiled: bool, 2025-05-07T20:32:34.8768327Z ) -> None: 2025-05-07T20:32:34.8768551Z torch.manual_seed(2025) 2025-05-07T20:32:34.8768800Z 2025-05-07T20:32:34.8769072Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.8769423Z 2025-05-07T20:32:34.8769617Z x_sign = torch.sign(x) 2025-05-07T20:32:34.8769915Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.8770230Z x = x_sign * x_clamp 2025-05-07T20:32:34.8770474Z x0 = x[:, :D] 2025-05-07T20:32:34.8770693Z x1 = x[:, D:] 2025-05-07T20:32:34.8770898Z 2025-05-07T20:32:34.8771086Z if contiguous: 2025-05-07T20:32:34.8771323Z x0 = x0.contiguous() 2025-05-07T20:32:34.8771582Z x1 = x1.contiguous() 2025-05-07T20:32:34.8771904Z 2025-05-07T20:32:34.8772101Z if scale_ub is not None: 2025-05-07T20:32:34.8772372Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.8772721Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.8773057Z ) 2025-05-07T20:32:34.8773273Z else: 2025-05-07T20:32:34.8773541Z scale_ub_tensor = None 2025-05-07T20:32:34.8773799Z 2025-05-07T20:32:34.8774028Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.8774351Z op = silu_mul_quant 2025-05-07T20:32:34.8774647Z if compiled: 2025-05-07T20:32:34.8774897Z op = torch.compile(op) 2025-05-07T20:32:34.8775196Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.8775479Z 2025-05-07T20:32:34.8775677Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.8775845Z 2025-05-07T20:32:34.8775946Z moe/activation_test.py:117: 2025-05-07T20:32:34.8776299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.8776642Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.8776929Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.8777651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.8778367Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.8778918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.8779614Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.8780299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.8780847Z kernel = self.compile( 2025-05-07T20:32:34.8781398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.8782077Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.8782487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.8782722Z 2025-05-07T20:32:34.8782938Z self = 2025-05-07T20:32:34.8784056Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.8785481Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917299a80>} 2025-05-07T20:32:34.8786875Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.8787935Z context = 2025-05-07T20:32:34.8788230Z 2025-05-07T20:32:34.8788456Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.8788991Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.8789476Z module_map=module_map) 2025-05-07T20:32:34.8789849Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.8790209Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.8790467Z E ^ 2025-05-07T20:32:34.8790945Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.8791409Z 2025-05-07T20:32:34.8791848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.8792381Z 2025-05-07T20:32:34.8792493Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.8792912Z self=, 2025-05-07T20:32:34.8793327Z T=2048, 2025-05-07T20:32:34.8793518Z D=5120, 2025-05-07T20:32:34.8793708Z scale_ub=None, 2025-05-07T20:32:34.8793927Z contiguous=True, 2025-05-07T20:32:34.8794153Z compiled=False, 2025-05-07T20:32:34.8794400Z ) 2025-05-07T20:32:34.8794732Z self = 2025-05-07T20:32:34.8795245Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:34.8795566Z 2025-05-07T20:32:34.8795643Z @given( 2025-05-07T20:32:34.8795879Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.8796198Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.8796505Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.8796843Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.8797220Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.8797517Z ) 2025-05-07T20:32:34.8797872Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.8798323Z def test_silu_mul_quant( 2025-05-07T20:32:34.8798573Z self, 2025-05-07T20:32:34.8798768Z T: int, 2025-05-07T20:32:34.8798977Z D: int, 2025-05-07T20:32:34.8799198Z scale_ub: Optional[float], 2025-05-07T20:32:34.8799471Z contiguous: bool, 2025-05-07T20:32:34.8799718Z compiled: bool, 2025-05-07T20:32:34.8799966Z ) -> None: 2025-05-07T20:32:34.8800190Z torch.manual_seed(2025) 2025-05-07T20:32:34.8800437Z 2025-05-07T20:32:34.8800709Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.8801059Z 2025-05-07T20:32:34.8801254Z > x_sign = torch.sign(x) 2025-05-07T20:32:34.8803345Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.8805295Z 2025-05-07T20:32:34.8805422Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:34.8805641Z 2025-05-07T20:32:34.8805746Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.8806544Z self=, 2025-05-07T20:32:34.8806967Z T=16384, 2025-05-07T20:32:34.8807156Z D=5120, 2025-05-07T20:32:34.8807352Z scale_ub=None, 2025-05-07T20:32:34.8807570Z contiguous=True, 2025-05-07T20:32:34.8807795Z compiled=False, 2025-05-07T20:32:34.8808008Z ) 2025-05-07T20:32:34.9557972Z self = 2025-05-07T20:32:34.9558854Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:34.9559152Z 2025-05-07T20:32:34.9559235Z @given( 2025-05-07T20:32:34.9559475Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.9559785Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.9560093Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.9560427Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.9560756Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.9561045Z ) 2025-05-07T20:32:34.9561397Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.9561847Z def test_silu_mul_quant( 2025-05-07T20:32:34.9562087Z self, 2025-05-07T20:32:34.9562286Z T: int, 2025-05-07T20:32:34.9562505Z D: int, 2025-05-07T20:32:34.9562746Z scale_ub: Optional[float], 2025-05-07T20:32:34.9563018Z contiguous: bool, 2025-05-07T20:32:34.9563264Z compiled: bool, 2025-05-07T20:32:34.9563491Z ) -> None: 2025-05-07T20:32:34.9563707Z torch.manual_seed(2025) 2025-05-07T20:32:34.9563949Z 2025-05-07T20:32:34.9564295Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.9566434Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.9568531Z 2025-05-07T20:32:34.9568653Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:34.9568876Z 2025-05-07T20:32:34.9568981Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.9569405Z self=, 2025-05-07T20:32:34.9569816Z T=4096, 2025-05-07T20:32:34.9570010Z D=5120, 2025-05-07T20:32:34.9570206Z scale_ub=None, 2025-05-07T20:32:34.9570417Z contiguous=True, 2025-05-07T20:32:34.9570645Z compiled=False, 2025-05-07T20:32:34.9570857Z ) 2025-05-07T20:32:34.9571180Z self = 2025-05-07T20:32:34.9571689Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:34.9572034Z 2025-05-07T20:32:34.9572122Z @given( 2025-05-07T20:32:34.9572357Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.9572675Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.9572988Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.9573323Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.9573656Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.9573951Z ) 2025-05-07T20:32:34.9574307Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.9574757Z def test_silu_mul_quant( 2025-05-07T20:32:34.9575004Z self, 2025-05-07T20:32:34.9575202Z T: int, 2025-05-07T20:32:34.9575398Z D: int, 2025-05-07T20:32:34.9575624Z scale_ub: Optional[float], 2025-05-07T20:32:34.9575903Z contiguous: bool, 2025-05-07T20:32:34.9576143Z compiled: bool, 2025-05-07T20:32:34.9576369Z ) -> None: 2025-05-07T20:32:34.9576586Z torch.manual_seed(2025) 2025-05-07T20:32:34.9576830Z 2025-05-07T20:32:34.9577097Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.9579275Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.9581208Z 2025-05-07T20:32:34.9581330Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:34.9581546Z 2025-05-07T20:32:34.9581655Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.9582071Z self=, 2025-05-07T20:32:34.9582482Z T=2048, 2025-05-07T20:32:34.9582670Z D=5120, 2025-05-07T20:32:34.9582864Z scale_ub=None, 2025-05-07T20:32:34.9583077Z contiguous=False, 2025-05-07T20:32:34.9583307Z compiled=False, 2025-05-07T20:32:34.9583512Z ) 2025-05-07T20:32:34.9583831Z self = 2025-05-07T20:32:34.9584347Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:34.9584627Z 2025-05-07T20:32:34.9584755Z @given( 2025-05-07T20:32:34.9584981Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.9585296Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.9585606Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.9585977Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.9586313Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.9586602Z ) 2025-05-07T20:32:34.9586959Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.9587449Z def test_silu_mul_quant( 2025-05-07T20:32:34.9587697Z self, 2025-05-07T20:32:34.9587894Z T: int, 2025-05-07T20:32:34.9588089Z D: int, 2025-05-07T20:32:34.9588313Z scale_ub: Optional[float], 2025-05-07T20:32:34.9588593Z contiguous: bool, 2025-05-07T20:32:34.9588830Z compiled: bool, 2025-05-07T20:32:34.9589057Z ) -> None: 2025-05-07T20:32:34.9589280Z torch.manual_seed(2025) 2025-05-07T20:32:34.9589520Z 2025-05-07T20:32:34.9589795Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.9591919Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.9593861Z 2025-05-07T20:32:34.9593980Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:34.9594197Z 2025-05-07T20:32:34.9594312Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.9594731Z self=, 2025-05-07T20:32:34.9595143Z T=4096, 2025-05-07T20:32:34.9595337Z D=7168, 2025-05-07T20:32:34.9595524Z scale_ub=None, 2025-05-07T20:32:34.9595743Z contiguous=True, 2025-05-07T20:32:34.9595974Z compiled=True, 2025-05-07T20:32:34.9596173Z ) 2025-05-07T20:32:34.9596495Z self = 2025-05-07T20:32:34.9597004Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:34.9597281Z 2025-05-07T20:32:34.9597367Z @given( 2025-05-07T20:32:34.9597596Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.9597914Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.9598224Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.9598601Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.9598940Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.9599233Z ) 2025-05-07T20:32:34.9599585Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.9600036Z def test_silu_mul_quant( 2025-05-07T20:32:34.9600279Z self, 2025-05-07T20:32:34.9600470Z T: int, 2025-05-07T20:32:34.9600672Z D: int, 2025-05-07T20:32:34.9600892Z scale_ub: Optional[float], 2025-05-07T20:32:34.9601160Z contiguous: bool, 2025-05-07T20:32:34.9601407Z compiled: bool, 2025-05-07T20:32:34.9601631Z ) -> None: 2025-05-07T20:32:34.9601853Z torch.manual_seed(2025) 2025-05-07T20:32:34.9602090Z 2025-05-07T20:32:34.9602370Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.9604546Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.9606799Z 2025-05-07T20:32:34.9606926Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:34.9607142Z 2025-05-07T20:32:34.9607245Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.9607666Z self=, 2025-05-07T20:32:34.9608156Z T=2048, 2025-05-07T20:32:34.9608356Z D=5120, 2025-05-07T20:32:34.9608554Z scale_ub=1200.0, 2025-05-07T20:32:34.9608793Z contiguous=False, 2025-05-07T20:32:34.9609033Z compiled=False, 2025-05-07T20:32:34.9609245Z ) 2025-05-07T20:32:34.9609604Z self = 2025-05-07T20:32:34.9610188Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:34.9610513Z 2025-05-07T20:32:34.9610593Z @given( 2025-05-07T20:32:34.9610837Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.9611187Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.9611524Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.9611964Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.9612303Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.9612599Z ) 2025-05-07T20:32:34.9612947Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.9613405Z def test_silu_mul_quant( 2025-05-07T20:32:34.9613656Z self, 2025-05-07T20:32:34.9613848Z T: int, 2025-05-07T20:32:34.9614049Z D: int, 2025-05-07T20:32:34.9614271Z scale_ub: Optional[float], 2025-05-07T20:32:34.9614540Z contiguous: bool, 2025-05-07T20:32:34.9614786Z compiled: bool, 2025-05-07T20:32:34.9615014Z ) -> None: 2025-05-07T20:32:34.9615225Z torch.manual_seed(2025) 2025-05-07T20:32:34.9615468Z 2025-05-07T20:32:34.9615742Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.9617864Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.9619872Z 2025-05-07T20:32:34.9619998Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:34.9620214Z 2025-05-07T20:32:34.9620317Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.9620739Z self=, 2025-05-07T20:32:34.9621154Z T=4096, 2025-05-07T20:32:34.9621336Z D=7168, 2025-05-07T20:32:34.9621527Z scale_ub=1200.0, 2025-05-07T20:32:34.9621754Z contiguous=True, 2025-05-07T20:32:34.9621973Z compiled=False, 2025-05-07T20:32:34.9622179Z ) 2025-05-07T20:32:35.0687492Z self = 2025-05-07T20:32:35.0688276Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.0688674Z 2025-05-07T20:32:35.0688754Z @given( 2025-05-07T20:32:35.0689007Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.0689335Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.0689649Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.0689984Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.0690551Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.0690840Z ) 2025-05-07T20:32:35.0691203Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.0691663Z def test_silu_mul_quant( 2025-05-07T20:32:35.0701341Z self, 2025-05-07T20:32:35.0701592Z T: int, 2025-05-07T20:32:35.0701797Z D: int, 2025-05-07T20:32:35.0702026Z scale_ub: Optional[float], 2025-05-07T20:32:35.0702311Z contiguous: bool, 2025-05-07T20:32:35.0702557Z compiled: bool, 2025-05-07T20:32:35.0702795Z ) -> None: 2025-05-07T20:32:35.0703177Z torch.manual_seed(2025) 2025-05-07T20:32:35.0703422Z 2025-05-07T20:32:35.0703706Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.0705865Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.0708091Z 2025-05-07T20:32:35.0708225Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.0708447Z 2025-05-07T20:32:35.0708560Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.0708986Z self=, 2025-05-07T20:32:35.0709413Z T=16384, 2025-05-07T20:32:35.0709616Z D=7168, 2025-05-07T20:32:35.0709815Z scale_ub=None, 2025-05-07T20:32:35.0710044Z contiguous=False, 2025-05-07T20:32:35.0710283Z compiled=True, 2025-05-07T20:32:35.0710494Z ) 2025-05-07T20:32:35.0710823Z self = 2025-05-07T20:32:35.0711347Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.0711634Z 2025-05-07T20:32:35.0711715Z @given( 2025-05-07T20:32:35.0711953Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.0712280Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.0712599Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.0712934Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.0713271Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.0713570Z ) 2025-05-07T20:32:35.0713926Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.0714385Z def test_silu_mul_quant( 2025-05-07T20:32:35.0714637Z self, 2025-05-07T20:32:35.0714922Z T: int, 2025-05-07T20:32:35.0715131Z D: int, 2025-05-07T20:32:35.0715369Z scale_ub: Optional[float], 2025-05-07T20:32:35.0715663Z contiguous: bool, 2025-05-07T20:32:35.0715929Z compiled: bool, 2025-05-07T20:32:35.0716169Z ) -> None: 2025-05-07T20:32:35.0716398Z torch.manual_seed(2025) 2025-05-07T20:32:35.0716668Z 2025-05-07T20:32:35.0716966Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.0719569Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.0721952Z 2025-05-07T20:32:35.0722088Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.0722400Z 2025-05-07T20:32:35.0722509Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.0722938Z self=, 2025-05-07T20:32:35.0723358Z T=4096, 2025-05-07T20:32:35.0723612Z D=7168, 2025-05-07T20:32:35.0723815Z scale_ub=None, 2025-05-07T20:32:35.0724046Z contiguous=True, 2025-05-07T20:32:35.0724276Z compiled=False, 2025-05-07T20:32:35.0724486Z ) 2025-05-07T20:32:35.0724807Z self = 2025-05-07T20:32:35.0725318Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.0725668Z 2025-05-07T20:32:35.0725751Z @given( 2025-05-07T20:32:35.0725991Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.0726319Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.0726631Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.0726975Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.0727319Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.0727609Z ) 2025-05-07T20:32:35.0727971Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.0728436Z def test_silu_mul_quant( 2025-05-07T20:32:35.0728692Z self, 2025-05-07T20:32:35.0728891Z T: int, 2025-05-07T20:32:35.0729095Z D: int, 2025-05-07T20:32:35.0729321Z scale_ub: Optional[float], 2025-05-07T20:32:35.0729595Z contiguous: bool, 2025-05-07T20:32:35.0729844Z compiled: bool, 2025-05-07T20:32:35.0730076Z ) -> None: 2025-05-07T20:32:35.0730294Z torch.manual_seed(2025) 2025-05-07T20:32:35.0730547Z 2025-05-07T20:32:35.0730824Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.0733037Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.0734979Z 2025-05-07T20:32:35.0735100Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.0735324Z 2025-05-07T20:32:35.0735432Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.0735861Z self=, 2025-05-07T20:32:35.0736279Z T=16384, 2025-05-07T20:32:35.0736476Z D=7168, 2025-05-07T20:32:35.0736730Z scale_ub=None, 2025-05-07T20:32:35.0736955Z contiguous=True, 2025-05-07T20:32:35.0737180Z compiled=False, 2025-05-07T20:32:35.0737390Z ) 2025-05-07T20:32:35.0737718Z self = 2025-05-07T20:32:35.0738226Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.0738522Z 2025-05-07T20:32:35.0738605Z @given( 2025-05-07T20:32:35.0738847Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.0739161Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.0739479Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.0739818Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.0740156Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.0740446Z ) 2025-05-07T20:32:35.0740811Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.0741278Z def test_silu_mul_quant( 2025-05-07T20:32:35.0741527Z self, 2025-05-07T20:32:35.0741733Z T: int, 2025-05-07T20:32:35.0741940Z D: int, 2025-05-07T20:32:35.0742583Z scale_ub: Optional[float], 2025-05-07T20:32:35.0742874Z contiguous: bool, 2025-05-07T20:32:35.0743124Z compiled: bool, 2025-05-07T20:32:35.0743351Z ) -> None: 2025-05-07T20:32:35.0743577Z torch.manual_seed(2025) 2025-05-07T20:32:35.0743880Z 2025-05-07T20:32:35.0744155Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.0746283Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.0748272Z 2025-05-07T20:32:35.0748399Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.0748622Z 2025-05-07T20:32:35.0748728Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.0749157Z self=, 2025-05-07T20:32:35.0749572Z T=16384, 2025-05-07T20:32:35.0749775Z D=7168, 2025-05-07T20:32:35.0749974Z scale_ub=1200.0, 2025-05-07T20:32:35.0750198Z contiguous=True, 2025-05-07T20:32:35.0750429Z compiled=False, 2025-05-07T20:32:35.0750641Z ) 2025-05-07T20:32:35.0750966Z self = 2025-05-07T20:32:35.0751485Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.0751780Z 2025-05-07T20:32:35.0751863Z @given( 2025-05-07T20:32:35.0752100Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.0752446Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.0752793Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.0753133Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.0753466Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.0753759Z ) 2025-05-07T20:32:35.0754128Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.0754586Z def test_silu_mul_quant( 2025-05-07T20:32:35.0754834Z self, 2025-05-07T20:32:35.0755037Z T: int, 2025-05-07T20:32:35.0755243Z D: int, 2025-05-07T20:32:35.0755463Z scale_ub: Optional[float], 2025-05-07T20:32:35.0755743Z contiguous: bool, 2025-05-07T20:32:35.0755994Z compiled: bool, 2025-05-07T20:32:35.0756218Z ) -> None: 2025-05-07T20:32:35.0756443Z torch.manual_seed(2025) 2025-05-07T20:32:35.0756695Z 2025-05-07T20:32:35.0757019Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.0759158Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.0761094Z 2025-05-07T20:32:35.0761215Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.0761440Z 2025-05-07T20:32:35.0761545Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.0761975Z self=, 2025-05-07T20:32:35.0762390Z T=128, 2025-05-07T20:32:35.0762587Z D=5120, 2025-05-07T20:32:35.0762786Z scale_ub=1200.0, 2025-05-07T20:32:35.0763012Z contiguous=False, 2025-05-07T20:32:35.0763294Z compiled=False, 2025-05-07T20:32:35.0763512Z ) 2025-05-07T20:32:35.2037543Z self = 2025-05-07T20:32:35.2038961Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.2039956Z 2025-05-07T20:32:35.2040118Z @given( 2025-05-07T20:32:35.2040582Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.2041213Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.2041870Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.2042566Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.2042899Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.2043196Z ) 2025-05-07T20:32:35.2043556Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.2044013Z def test_silu_mul_quant( 2025-05-07T20:32:35.2044268Z self, 2025-05-07T20:32:35.2044469Z T: int, 2025-05-07T20:32:35.2044670Z D: int, 2025-05-07T20:32:35.2044894Z scale_ub: Optional[float], 2025-05-07T20:32:35.2045170Z contiguous: bool, 2025-05-07T20:32:35.2045408Z compiled: bool, 2025-05-07T20:32:35.2045647Z ) -> None: 2025-05-07T20:32:35.2045869Z torch.manual_seed(2025) 2025-05-07T20:32:35.2046112Z 2025-05-07T20:32:35.2046394Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.2046742Z 2025-05-07T20:32:35.2046936Z x_sign = torch.sign(x) 2025-05-07T20:32:35.2047233Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.2047555Z x = x_sign * x_clamp 2025-05-07T20:32:35.2047795Z x0 = x[:, :D] 2025-05-07T20:32:35.2048022Z x1 = x[:, D:] 2025-05-07T20:32:35.2048237Z 2025-05-07T20:32:35.2048428Z if contiguous: 2025-05-07T20:32:35.2048670Z x0 = x0.contiguous() 2025-05-07T20:32:35.2048934Z x1 = x1.contiguous() 2025-05-07T20:32:35.2049174Z 2025-05-07T20:32:35.2049371Z if scale_ub is not None: 2025-05-07T20:32:35.2049648Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.2049985Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.2050316Z ) 2025-05-07T20:32:35.2050520Z else: 2025-05-07T20:32:35.2050737Z scale_ub_tensor = None 2025-05-07T20:32:35.2050989Z 2025-05-07T20:32:35.2051228Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.2051549Z op = silu_mul_quant 2025-05-07T20:32:35.2051805Z if compiled: 2025-05-07T20:32:35.2052168Z op = torch.compile(op) 2025-05-07T20:32:35.2052478Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.2052751Z 2025-05-07T20:32:35.2053072Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.2053266Z 2025-05-07T20:32:35.2053378Z moe/activation_test.py:117: 2025-05-07T20:32:35.2053682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2054031Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.2054322Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.2055046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.2055760Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.2056318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.2057027Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.2057711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.2058265Z kernel = self.compile( 2025-05-07T20:32:35.2058826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.2059583Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.2059989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2060229Z 2025-05-07T20:32:35.2060483Z self = 2025-05-07T20:32:35.2061605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.2063087Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89173e07c0>} 2025-05-07T20:32:35.2064473Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.2065533Z context = 2025-05-07T20:32:35.2065838Z 2025-05-07T20:32:35.2066011Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.2066555Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.2067030Z module_map=module_map) 2025-05-07T20:32:35.2067404Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.2067771Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.2068039Z E ^ 2025-05-07T20:32:35.2068515Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.2068985Z 2025-05-07T20:32:35.2069418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.2069947Z 2025-05-07T20:32:35.2070063Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.2070488Z self=, 2025-05-07T20:32:35.2070898Z T=2048, 2025-05-07T20:32:35.2071093Z D=7168, 2025-05-07T20:32:35.2071292Z scale_ub=None, 2025-05-07T20:32:35.2071509Z contiguous=False, 2025-05-07T20:32:35.2071741Z compiled=False, 2025-05-07T20:32:35.2071953Z ) 2025-05-07T20:32:35.2072275Z self = 2025-05-07T20:32:35.2072792Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.2073077Z 2025-05-07T20:32:35.2073163Z @given( 2025-05-07T20:32:35.2073395Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.2073717Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.2074081Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.2074423Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.2074757Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.2075050Z ) 2025-05-07T20:32:35.2075410Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.2075861Z def test_silu_mul_quant( 2025-05-07T20:32:35.2076112Z self, 2025-05-07T20:32:35.2076312Z T: int, 2025-05-07T20:32:35.2076504Z D: int, 2025-05-07T20:32:35.2076731Z scale_ub: Optional[float], 2025-05-07T20:32:35.2077010Z contiguous: bool, 2025-05-07T20:32:35.2077252Z compiled: bool, 2025-05-07T20:32:35.2077489Z ) -> None: 2025-05-07T20:32:35.2077710Z torch.manual_seed(2025) 2025-05-07T20:32:35.2077949Z 2025-05-07T20:32:35.2078226Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.2080421Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.2082399Z 2025-05-07T20:32:35.2082520Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.2082736Z 2025-05-07T20:32:35.2082846Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.2083317Z self=, 2025-05-07T20:32:35.2083731Z T=128, 2025-05-07T20:32:35.2083927Z D=7168, 2025-05-07T20:32:35.2084114Z scale_ub=1200.0, 2025-05-07T20:32:35.2084344Z contiguous=True, 2025-05-07T20:32:35.2084578Z compiled=True, 2025-05-07T20:32:35.2084779Z ) 2025-05-07T20:32:35.2393314Z self = 2025-05-07T20:32:35.2394782Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.2395547Z 2025-05-07T20:32:35.2395749Z @given( 2025-05-07T20:32:35.2396362Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.2397205Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.2397993Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.2398759Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.2399401Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.2399971Z ) 2025-05-07T20:32:35.2400665Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.2401538Z def test_silu_mul_quant( 2025-05-07T20:32:35.2402021Z self, 2025-05-07T20:32:35.2402395Z T: int, 2025-05-07T20:32:35.2402775Z D: int, 2025-05-07T20:32:35.2403059Z scale_ub: Optional[float], 2025-05-07T20:32:35.2403377Z contiguous: bool, 2025-05-07T20:32:35.2403619Z compiled: bool, 2025-05-07T20:32:35.2403841Z ) -> None: 2025-05-07T20:32:35.2404056Z torch.manual_seed(2025) 2025-05-07T20:32:35.2404304Z 2025-05-07T20:32:35.2404574Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.2404921Z 2025-05-07T20:32:35.2405116Z x_sign = torch.sign(x) 2025-05-07T20:32:35.2405403Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.2405719Z x = x_sign * x_clamp 2025-05-07T20:32:35.2405962Z x0 = x[:, :D] 2025-05-07T20:32:35.2406463Z x1 = x[:, D:] 2025-05-07T20:32:35.2406676Z 2025-05-07T20:32:35.2406865Z if contiguous: 2025-05-07T20:32:35.2407091Z x0 = x0.contiguous() 2025-05-07T20:32:35.2407559Z x1 = x1.contiguous() 2025-05-07T20:32:35.2407804Z 2025-05-07T20:32:35.2407994Z if scale_ub is not None: 2025-05-07T20:32:35.2408273Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.2408611Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.2408926Z ) 2025-05-07T20:32:35.2409119Z else: 2025-05-07T20:32:35.2409337Z scale_ub_tensor = None 2025-05-07T20:32:35.2409590Z 2025-05-07T20:32:35.2409824Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.2410146Z op = silu_mul_quant 2025-05-07T20:32:35.2410402Z if compiled: 2025-05-07T20:32:35.2410647Z op = torch.compile(op) 2025-05-07T20:32:35.2410955Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.2411239Z 2025-05-07T20:32:35.2411431Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.2411604Z 2025-05-07T20:32:35.2411704Z moe/activation_test.py:117: 2025-05-07T20:32:35.2412092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2412430Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.2412789Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.2413368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.2413944Z return fn(*args, **kwargs) 2025-05-07T20:32:35.2414686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.2415400Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.2415953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.2416733Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.2417420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.2417970Z kernel = self.compile( 2025-05-07T20:32:35.2418530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.2419199Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.2419612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2419857Z 2025-05-07T20:32:35.2420067Z self = 2025-05-07T20:32:35.2421186Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.2422619Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89173e1940>} 2025-05-07T20:32:35.2424010Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.2425073Z context = 2025-05-07T20:32:35.2425370Z 2025-05-07T20:32:35.2425550Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.2426091Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.2426568Z module_map=module_map) 2025-05-07T20:32:35.2426943Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.2427315Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.2427577Z E ^ 2025-05-07T20:32:35.2428150Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.2428617Z 2025-05-07T20:32:35.2429057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.2429587Z 2025-05-07T20:32:35.2429698Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.2430123Z self=, 2025-05-07T20:32:35.2430547Z T=128, 2025-05-07T20:32:35.2430746Z D=7168, 2025-05-07T20:32:35.2430942Z scale_ub=1200.0, 2025-05-07T20:32:35.2431181Z contiguous=True, 2025-05-07T20:32:35.2431414Z compiled=False, 2025-05-07T20:32:35.2431624Z ) 2025-05-07T20:32:35.2431953Z self = 2025-05-07T20:32:35.2432467Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.2432745Z 2025-05-07T20:32:35.2432833Z @given( 2025-05-07T20:32:35.2433063Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.2433385Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.2433702Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.2434084Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.2434425Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.2434719Z ) 2025-05-07T20:32:35.2435072Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.2435592Z def test_silu_mul_quant( 2025-05-07T20:32:35.2435847Z self, 2025-05-07T20:32:35.2436075Z T: int, 2025-05-07T20:32:35.2436273Z D: int, 2025-05-07T20:32:35.2436498Z scale_ub: Optional[float], 2025-05-07T20:32:35.2436775Z contiguous: bool, 2025-05-07T20:32:35.2437063Z compiled: bool, 2025-05-07T20:32:35.2437296Z ) -> None: 2025-05-07T20:32:35.2437516Z torch.manual_seed(2025) 2025-05-07T20:32:35.2437758Z 2025-05-07T20:32:35.2438038Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.2438390Z 2025-05-07T20:32:35.2438584Z x_sign = torch.sign(x) 2025-05-07T20:32:35.2438885Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.2440983Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.2442965Z 2025-05-07T20:32:35.2443114Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:35.2443333Z 2025-05-07T20:32:35.2443447Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.2443869Z self=, 2025-05-07T20:32:35.2444305Z T=128, 2025-05-07T20:32:35.2444506Z D=5120, 2025-05-07T20:32:35.2444703Z scale_ub=1200.0, 2025-05-07T20:32:35.2444934Z contiguous=True, 2025-05-07T20:32:35.2445166Z compiled=True, 2025-05-07T20:32:35.2445368Z ) 2025-05-07T20:32:35.2445700Z self = 2025-05-07T20:32:35.2446215Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.2446493Z 2025-05-07T20:32:35.2446580Z @given( 2025-05-07T20:32:35.2446815Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.2456710Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.2457038Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.2457371Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.2457806Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.2458098Z ) 2025-05-07T20:32:35.2458452Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.2458911Z def test_silu_mul_quant( 2025-05-07T20:32:35.2459161Z self, 2025-05-07T20:32:35.2459365Z T: int, 2025-05-07T20:32:35.2459562Z D: int, 2025-05-07T20:32:35.2459789Z scale_ub: Optional[float], 2025-05-07T20:32:35.2460074Z contiguous: bool, 2025-05-07T20:32:35.2460319Z compiled: bool, 2025-05-07T20:32:35.2460550Z ) -> None: 2025-05-07T20:32:35.2460776Z torch.manual_seed(2025) 2025-05-07T20:32:35.2461018Z 2025-05-07T20:32:35.2461301Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.2461659Z 2025-05-07T20:32:35.2461852Z x_sign = torch.sign(x) 2025-05-07T20:32:35.2462149Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.2464281Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.2466728Z 2025-05-07T20:32:35.2466857Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:35.2467074Z 2025-05-07T20:32:35.2467184Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.2467600Z self=, 2025-05-07T20:32:35.2468065Z T=128, 2025-05-07T20:32:35.2468258Z D=7168, 2025-05-07T20:32:35.2468450Z scale_ub=None, 2025-05-07T20:32:35.2468670Z contiguous=True, 2025-05-07T20:32:35.2468900Z compiled=True, 2025-05-07T20:32:35.2469103Z ) 2025-05-07T20:32:35.4946987Z self = 2025-05-07T20:32:35.4947551Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.4947832Z 2025-05-07T20:32:35.4947913Z @given( 2025-05-07T20:32:35.4948153Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.4948475Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.4948791Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.4949134Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.4949472Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.4949769Z ) 2025-05-07T20:32:35.4950138Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.4950596Z def test_silu_mul_quant( 2025-05-07T20:32:35.4950852Z self, 2025-05-07T20:32:35.4951064Z T: int, 2025-05-07T20:32:35.4951274Z D: int, 2025-05-07T20:32:35.4951498Z scale_ub: Optional[float], 2025-05-07T20:32:35.4951784Z contiguous: bool, 2025-05-07T20:32:35.4952040Z compiled: bool, 2025-05-07T20:32:35.4952274Z ) -> None: 2025-05-07T20:32:35.4952500Z torch.manual_seed(2025) 2025-05-07T20:32:35.4952782Z 2025-05-07T20:32:35.4953079Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.4955486Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.4957465Z 2025-05-07T20:32:35.4973051Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.4973371Z 2025-05-07T20:32:35.4973491Z FAILED 2025-05-07T20:32:35.4973648Z 2025-05-07T20:32:35.4973831Z =================================== FAILURES =================================== 2025-05-07T20:32:35.4974448Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:35.4975102Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:35.4975996Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:32:35.4976791Z | yield 2025-05-07T20:32:35.4977414Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run 2025-05-07T20:32:35.4978182Z | self._callTestMethod(testMethod) 2025-05-07T20:32:35.4979005Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod 2025-05-07T20:32:35.4979825Z | if method() is not None: 2025-05-07T20:32:35.4980183Z | ^^^^^^^^ 2025-05-07T20:32:35.4981453Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:35.4982568Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.4983137Z | ^^^^^^^ 2025-05-07T20:32:35.4983966Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:35.4984906Z | raise the_error_hypothesis_found 2025-05-07T20:32:35.4985527Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:35.4986259Z +-+---------------- 1 ---------------- 2025-05-07T20:32:35.4986676Z | Traceback (most recent call last): 2025-05-07T20:32:35.4987739Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:35.4988911Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.4989454Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:35.4992499Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.4995562Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:35.4996208Z | self=, 2025-05-07T20:32:35.4996806Z | T=2048, 2025-05-07T20:32:35.4997141Z | D=5120, # or any other generated value 2025-05-07T20:32:35.4997628Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:35.4998142Z | contiguous=True, # or any other generated value 2025-05-07T20:32:35.4998675Z | compiled=False, # or any other generated value 2025-05-07T20:32:35.4999111Z | ) 2025-05-07T20:32:35.4999363Z | 2025-05-07T20:32:35.5000135Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:35.5001050Z +---------------- 2 ---------------- 2025-05-07T20:32:35.5001470Z | Traceback (most recent call last): 2025-05-07T20:32:35.5002606Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:35.5003784Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5004329Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:35.5007614Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.5010157Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:35.5010609Z | self=, 2025-05-07T20:32:35.5011030Z | T=128, 2025-05-07T20:32:35.5011232Z | D=7168, 2025-05-07T20:32:35.5011436Z | scale_ub=None, 2025-05-07T20:32:35.5011779Z | contiguous=True, 2025-05-07T20:32:35.5012127Z | compiled=True, 2025-05-07T20:32:35.5012366Z | ) 2025-05-07T20:32:35.5012553Z | 2025-05-07T20:32:35.5013097Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:35.5013789Z +---------------- 3 ---------------- 2025-05-07T20:32:35.5014086Z | Traceback (most recent call last): 2025-05-07T20:32:35.5014818Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:35.5015684Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5016075Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:35.5018143Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.5020190Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:35.5020660Z | self=, 2025-05-07T20:32:35.5021083Z | T=128, 2025-05-07T20:32:35.5021295Z | D=5120, 2025-05-07T20:32:35.5021516Z | scale_ub=1200.0, 2025-05-07T20:32:35.5021767Z | contiguous=True, 2025-05-07T20:32:35.5022010Z | compiled=True, 2025-05-07T20:32:35.5022244Z | ) 2025-05-07T20:32:35.5022429Z | 2025-05-07T20:32:35.5023020Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:35.5023655Z +---------------- 4 ---------------- 2025-05-07T20:32:35.5023955Z | Traceback (most recent call last): 2025-05-07T20:32:35.5024689Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:35.5025432Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.5025730Z | ^^^^^^^^ 2025-05-07T20:32:35.5026591Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:35.5027709Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.5028297Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:35.5029639Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:35.5030754Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.5031486Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:35.5032376Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5032897Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:35.5033597Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:35.5034410Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.5034898Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:35.5035921Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:35.5036969Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.5037587Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:35.5038455Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:35.5039316Z | fn() 2025-05-07T20:32:35.5040165Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:35.5041221Z | self.fn.run( 2025-05-07T20:32:35.5041902Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:35.5042754Z | kernel = self.compile( 2025-05-07T20:32:35.5043138Z | ^^^^^^^^^^^^^ 2025-05-07T20:32:35.5044010Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:35.5045052Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5045608Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:35.5046563Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:35.5047741Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5048439Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:35.5048984Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5049493Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.5049886Z | ^ 2025-05-07T20:32:35.5050513Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5051108Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:35.5051531Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:35.5052280Z | self=, 2025-05-07T20:32:35.5056543Z | T=1, # or any other generated value 2025-05-07T20:32:35.5056972Z | D=5120, # or any other generated value 2025-05-07T20:32:35.5057455Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:35.5057952Z | contiguous=True, # or any other generated value 2025-05-07T20:32:35.5058460Z | compiled=True, # or any other generated value 2025-05-07T20:32:35.5058992Z | ) 2025-05-07T20:32:35.5059234Z | 2025-05-07T20:32:35.5059975Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:35.5060828Z +------------------------------------ 2025-05-07T20:32:35.5061327Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:35.5061854Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5062435Z self=, 2025-05-07T20:32:35.5063003Z T=1, 2025-05-07T20:32:35.5063267Z D=5120, 2025-05-07T20:32:35.5063533Z scale_ub=None, 2025-05-07T20:32:35.5063835Z contiguous=True, 2025-05-07T20:32:35.5064144Z compiled=True, 2025-05-07T20:32:35.5064435Z ) 2025-05-07T20:32:35.5064881Z self = 2025-05-07T20:32:35.5065557Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.5065929Z 2025-05-07T20:32:35.5066038Z @given( 2025-05-07T20:32:35.5066422Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5066863Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5067284Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5067750Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5068338Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5068737Z ) 2025-05-07T20:32:35.5069248Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5069904Z def test_silu_mul_quant( 2025-05-07T20:32:35.5070242Z self, 2025-05-07T20:32:35.5070583Z T: int, 2025-05-07T20:32:35.5070858Z D: int, 2025-05-07T20:32:35.5071158Z scale_ub: Optional[float], 2025-05-07T20:32:35.5071542Z contiguous: bool, 2025-05-07T20:32:35.5071884Z compiled: bool, 2025-05-07T20:32:35.5072195Z ) -> None: 2025-05-07T20:32:35.5072489Z torch.manual_seed(2025) 2025-05-07T20:32:35.5072835Z 2025-05-07T20:32:35.5073214Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5073693Z 2025-05-07T20:32:35.5073968Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5074378Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5074817Z x = x_sign * x_clamp 2025-05-07T20:32:35.5075164Z x0 = x[:, :D] 2025-05-07T20:32:35.5075475Z x1 = x[:, D:] 2025-05-07T20:32:35.5075765Z 2025-05-07T20:32:35.5076030Z if contiguous: 2025-05-07T20:32:35.5076361Z x0 = x0.contiguous() 2025-05-07T20:32:35.5076722Z x1 = x1.contiguous() 2025-05-07T20:32:35.5077066Z 2025-05-07T20:32:35.5077341Z if scale_ub is not None: 2025-05-07T20:32:35.5077717Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5078192Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5078634Z ) 2025-05-07T20:32:35.5078898Z else: 2025-05-07T20:32:35.5079200Z scale_ub_tensor = None 2025-05-07T20:32:35.5079575Z 2025-05-07T20:32:35.5079900Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5080354Z op = silu_mul_quant 2025-05-07T20:32:35.5080698Z if compiled: 2025-05-07T20:32:35.5081038Z op = torch.compile(op) 2025-05-07T20:32:35.5081451Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5081839Z 2025-05-07T20:32:35.5082106Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.5082496Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.5082970Z 2025-05-07T20:32:35.5083300Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5083744Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.5084150Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.5084653Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.5085143Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.5085587Z 2025-05-07T20:32:35.5085870Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.5086138Z 2025-05-07T20:32:35.5086285Z moe/activation_test.py:126: 2025-05-07T20:32:35.5086702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5087186Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.5087649Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.5088772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.5089815Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.5090574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5091532Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5092673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.5093677Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.5094705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.5095659Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.5096518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.5097264Z fn() 2025-05-07T20:32:35.5098038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.5098879Z self.fn.run( 2025-05-07T20:32:35.5099535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5100301Z kernel = self.compile( 2025-05-07T20:32:35.5101060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5101968Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5102496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5102803Z 2025-05-07T20:32:35.5103082Z self = 2025-05-07T20:32:35.5104535Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5106845Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8aef86dc60>} 2025-05-07T20:32:35.5108716Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5110143Z context = 2025-05-07T20:32:35.5110537Z 2025-05-07T20:32:35.5110770Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5111488Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5112136Z module_map=module_map) 2025-05-07T20:32:35.5112658Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5113175Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.5113535Z E ^ 2025-05-07T20:32:35.5114368Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5115024Z 2025-05-07T20:32:35.5115621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5116347Z 2025-05-07T20:32:35.5116489Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5117069Z self=, 2025-05-07T20:32:35.5117633Z T=2048, 2025-05-07T20:32:35.5117890Z D=5120, 2025-05-07T20:32:35.5118144Z scale_ub=1200.0, 2025-05-07T20:32:35.5118451Z contiguous=True, 2025-05-07T20:32:35.5118763Z compiled=False, 2025-05-07T20:32:35.5119043Z ) 2025-05-07T20:32:35.5119474Z self = 2025-05-07T20:32:35.5120158Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.5120535Z 2025-05-07T20:32:35.5120649Z @given( 2025-05-07T20:32:35.5120963Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5121385Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5121869Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5122307Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5122740Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5123115Z ) 2025-05-07T20:32:35.5123656Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5124249Z def test_silu_mul_quant( 2025-05-07T20:32:35.5124574Z self, 2025-05-07T20:32:35.5124826Z T: int, 2025-05-07T20:32:35.5125096Z D: int, 2025-05-07T20:32:35.5125389Z scale_ub: Optional[float], 2025-05-07T20:32:35.5125826Z contiguous: bool, 2025-05-07T20:32:35.5126160Z compiled: bool, 2025-05-07T20:32:35.5126468Z ) -> None: 2025-05-07T20:32:35.5126744Z torch.manual_seed(2025) 2025-05-07T20:32:35.5127082Z 2025-05-07T20:32:35.5127461Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5127933Z 2025-05-07T20:32:35.5128200Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5128615Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5129050Z x = x_sign * x_clamp 2025-05-07T20:32:35.5129410Z x0 = x[:, :D] 2025-05-07T20:32:35.5129722Z x1 = x[:, D:] 2025-05-07T20:32:35.5130017Z 2025-05-07T20:32:35.5130270Z if contiguous: 2025-05-07T20:32:35.5130590Z x0 = x0.contiguous() 2025-05-07T20:32:35.5130948Z x1 = x1.contiguous() 2025-05-07T20:32:35.5131277Z 2025-05-07T20:32:35.5131542Z if scale_ub is not None: 2025-05-07T20:32:35.5132016Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5132476Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5132902Z ) 2025-05-07T20:32:35.5133173Z else: 2025-05-07T20:32:35.5154615Z scale_ub_tensor = None 2025-05-07T20:32:35.5154999Z 2025-05-07T20:32:35.5155316Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5155771Z op = silu_mul_quant 2025-05-07T20:32:35.5156121Z if compiled: 2025-05-07T20:32:35.5156448Z op = torch.compile(op) 2025-05-07T20:32:35.5156840Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5157232Z 2025-05-07T20:32:35.5157499Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5157732Z 2025-05-07T20:32:35.5157865Z moe/activation_test.py:117: 2025-05-07T20:32:35.5158261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5158710Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5159098Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5160064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5161125Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5161847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5162761Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5163660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5164386Z kernel = self.compile( 2025-05-07T20:32:35.5165114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5165995Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5166530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5166841Z 2025-05-07T20:32:35.5167119Z self = 2025-05-07T20:32:35.5168663Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5170581Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8aef6c8220>} 2025-05-07T20:32:35.5172583Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5174030Z context = 2025-05-07T20:32:35.5174495Z 2025-05-07T20:32:35.5174732Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5175462Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5176137Z module_map=module_map) 2025-05-07T20:32:35.5176639Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5177130Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5177488Z E ^ 2025-05-07T20:32:35.5178146Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5178814Z 2025-05-07T20:32:35.5179426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5180192Z 2025-05-07T20:32:35.5180350Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5180930Z self=, 2025-05-07T20:32:35.5181502Z T=2048, 2025-05-07T20:32:35.5181765Z D=5120, 2025-05-07T20:32:35.5182034Z scale_ub=1200.0, 2025-05-07T20:32:35.5182351Z contiguous=True, 2025-05-07T20:32:35.5182679Z compiled=True, 2025-05-07T20:32:35.5182960Z ) 2025-05-07T20:32:35.5183409Z self = 2025-05-07T20:32:35.5184106Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.5184475Z 2025-05-07T20:32:35.5184583Z @given( 2025-05-07T20:32:35.5184901Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5185354Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5185791Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5186245Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5186697Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5187113Z ) 2025-05-07T20:32:35.5187588Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5188222Z def test_silu_mul_quant( 2025-05-07T20:32:35.5188556Z self, 2025-05-07T20:32:35.5188828Z T: int, 2025-05-07T20:32:35.5189184Z D: int, 2025-05-07T20:32:35.5189495Z scale_ub: Optional[float], 2025-05-07T20:32:35.5189869Z contiguous: bool, 2025-05-07T20:32:35.5190205Z compiled: bool, 2025-05-07T20:32:35.5190516Z ) -> None: 2025-05-07T20:32:35.5190807Z torch.manual_seed(2025) 2025-05-07T20:32:35.5191145Z 2025-05-07T20:32:35.5191535Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5192016Z 2025-05-07T20:32:35.5192292Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5192713Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5193154Z x = x_sign * x_clamp 2025-05-07T20:32:35.5193476Z x0 = x[:, :D] 2025-05-07T20:32:35.5193769Z x1 = x[:, D:] 2025-05-07T20:32:35.5194064Z 2025-05-07T20:32:35.5194307Z if contiguous: 2025-05-07T20:32:35.5194617Z x0 = x0.contiguous() 2025-05-07T20:32:35.5194970Z x1 = x1.contiguous() 2025-05-07T20:32:35.5195298Z 2025-05-07T20:32:35.5195562Z if scale_ub is not None: 2025-05-07T20:32:35.5195933Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5196445Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5196875Z ) 2025-05-07T20:32:35.5197143Z else: 2025-05-07T20:32:35.5197420Z scale_ub_tensor = None 2025-05-07T20:32:35.5197749Z 2025-05-07T20:32:35.5198118Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5198552Z op = silu_mul_quant 2025-05-07T20:32:35.5198897Z if compiled: 2025-05-07T20:32:35.5199230Z op = torch.compile(op) 2025-05-07T20:32:35.5199630Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5200061Z 2025-05-07T20:32:35.5200322Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.5200706Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.5201092Z 2025-05-07T20:32:35.5201408Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5201851Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.5202239Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.5202665Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.5203152Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.5203565Z 2025-05-07T20:32:35.5203835Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.5204100Z 2025-05-07T20:32:35.5204238Z moe/activation_test.py:126: 2025-05-07T20:32:35.5204632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5205084Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.5205545Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.5206886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.5207974Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.5208747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5209578Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5210299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.5211050Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.5211894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.5212566Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.5213191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.5213723Z fn() 2025-05-07T20:32:35.5214407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.5215016Z self.fn.run( 2025-05-07T20:32:35.5215498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5216054Z kernel = self.compile( 2025-05-07T20:32:35.5216612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5217290Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5217696Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5217938Z 2025-05-07T20:32:35.5218154Z self = 2025-05-07T20:32:35.5219283Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5220789Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8aef6c96c0>} 2025-05-07T20:32:35.5222180Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5223300Z context = 2025-05-07T20:32:35.5223603Z 2025-05-07T20:32:35.5223774Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5224309Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5224854Z module_map=module_map) 2025-05-07T20:32:35.5225230Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5225598Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.5225868Z E ^ 2025-05-07T20:32:35.5226347Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5226818Z 2025-05-07T20:32:35.5227251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5227788Z 2025-05-07T20:32:35.5227898Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5228323Z self=, 2025-05-07T20:32:35.5228737Z T=16384, 2025-05-07T20:32:35.5228935Z D=7168, 2025-05-07T20:32:35.5229126Z scale_ub=1200.0, 2025-05-07T20:32:35.5229359Z contiguous=False, 2025-05-07T20:32:35.5229593Z compiled=False, 2025-05-07T20:32:35.5229797Z ) 2025-05-07T20:32:35.5230124Z self = 2025-05-07T20:32:35.5230647Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.5230939Z 2025-05-07T20:32:35.5231023Z @given( 2025-05-07T20:32:35.5231258Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5231582Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5231899Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5232234Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5232580Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5232873Z ) 2025-05-07T20:32:35.5233227Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5233687Z def test_silu_mul_quant( 2025-05-07T20:32:35.5233938Z self, 2025-05-07T20:32:35.5234143Z T: int, 2025-05-07T20:32:35.5234344Z D: int, 2025-05-07T20:32:35.5234571Z scale_ub: Optional[float], 2025-05-07T20:32:35.5234848Z contiguous: bool, 2025-05-07T20:32:35.5235142Z compiled: bool, 2025-05-07T20:32:35.5235372Z ) -> None: 2025-05-07T20:32:35.5235591Z torch.manual_seed(2025) 2025-05-07T20:32:35.5235833Z 2025-05-07T20:32:35.5236114Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5236467Z 2025-05-07T20:32:35.5236662Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5236960Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5237280Z x = x_sign * x_clamp 2025-05-07T20:32:35.5237520Z x0 = x[:, :D] 2025-05-07T20:32:35.5237744Z x1 = x[:, D:] 2025-05-07T20:32:35.5237957Z 2025-05-07T20:32:35.5238137Z if contiguous: 2025-05-07T20:32:35.5238370Z x0 = x0.contiguous() 2025-05-07T20:32:35.5238635Z x1 = x1.contiguous() 2025-05-07T20:32:35.5238885Z 2025-05-07T20:32:35.5239075Z if scale_ub is not None: 2025-05-07T20:32:35.5239354Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5239702Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5240013Z ) 2025-05-07T20:32:35.5240211Z else: 2025-05-07T20:32:35.5240478Z scale_ub_tensor = None 2025-05-07T20:32:35.5240733Z 2025-05-07T20:32:35.5240971Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5241295Z op = silu_mul_quant 2025-05-07T20:32:35.5241617Z if compiled: 2025-05-07T20:32:35.5241870Z op = torch.compile(op) 2025-05-07T20:32:35.5242175Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5242454Z 2025-05-07T20:32:35.5242650Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5242817Z 2025-05-07T20:32:35.5242925Z moe/activation_test.py:117: 2025-05-07T20:32:35.5243277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5243629Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5243920Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5244639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5245352Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5245907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5246617Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5247304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5247859Z kernel = self.compile( 2025-05-07T20:32:35.5248422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5249119Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5249533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5249780Z 2025-05-07T20:32:35.5249998Z self = 2025-05-07T20:32:35.5251147Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5252643Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8aee57ce00>} 2025-05-07T20:32:35.5254086Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5255181Z context = 2025-05-07T20:32:35.5255493Z 2025-05-07T20:32:35.5255714Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5256265Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5256752Z module_map=module_map) 2025-05-07T20:32:35.5257130Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5257497Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5257768Z E ^ 2025-05-07T20:32:35.5258247Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5258735Z 2025-05-07T20:32:35.5259175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5259721Z 2025-05-07T20:32:35.5259834Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5260269Z self=, 2025-05-07T20:32:35.5260684Z T=1, 2025-05-07T20:32:35.5260874Z D=7168, 2025-05-07T20:32:35.5261072Z scale_ub=None, 2025-05-07T20:32:35.5261288Z contiguous=True, 2025-05-07T20:32:35.5261517Z compiled=True, 2025-05-07T20:32:35.5261781Z ) 2025-05-07T20:32:35.5262111Z self = 2025-05-07T20:32:35.5262619Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.5262928Z 2025-05-07T20:32:35.5263015Z @given( 2025-05-07T20:32:35.5263247Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5263571Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5263886Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5264227Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5264608Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5264906Z ) 2025-05-07T20:32:35.5265277Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5265733Z def test_silu_mul_quant( 2025-05-07T20:32:35.5265986Z self, 2025-05-07T20:32:35.5266193Z T: int, 2025-05-07T20:32:35.5266393Z D: int, 2025-05-07T20:32:35.5266619Z scale_ub: Optional[float], 2025-05-07T20:32:35.5266903Z contiguous: bool, 2025-05-07T20:32:35.5267148Z compiled: bool, 2025-05-07T20:32:35.5267380Z ) -> None: 2025-05-07T20:32:35.5267610Z torch.manual_seed(2025) 2025-05-07T20:32:35.5267853Z 2025-05-07T20:32:35.5268134Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5268491Z 2025-05-07T20:32:35.5268686Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5268988Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5269313Z x = x_sign * x_clamp 2025-05-07T20:32:35.5269567Z x0 = x[:, :D] 2025-05-07T20:32:35.5269787Z x1 = x[:, D:] 2025-05-07T20:32:35.5270001Z 2025-05-07T20:32:35.5270199Z if contiguous: 2025-05-07T20:32:35.5270433Z x0 = x0.contiguous() 2025-05-07T20:32:35.5270701Z x1 = x1.contiguous() 2025-05-07T20:32:35.5270949Z 2025-05-07T20:32:35.5271144Z if scale_ub is not None: 2025-05-07T20:32:35.5271429Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5271778Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5272101Z ) 2025-05-07T20:32:35.5272302Z else: 2025-05-07T20:32:35.5272521Z scale_ub_tensor = None 2025-05-07T20:32:35.5272797Z 2025-05-07T20:32:35.5273059Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5273385Z op = silu_mul_quant 2025-05-07T20:32:35.5273637Z if compiled: 2025-05-07T20:32:35.5273892Z op = torch.compile(op) 2025-05-07T20:32:35.5274197Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5274483Z 2025-05-07T20:32:35.5274674Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.5275023Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.5275326Z 2025-05-07T20:32:35.5275572Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5275921Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.5276229Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.5276549Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.5276926Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.5277248Z 2025-05-07T20:32:35.5277453Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.5277660Z 2025-05-07T20:32:35.5277760Z moe/activation_test.py:126: 2025-05-07T20:32:35.5278072Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5278423Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.5278757Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.5279591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.5280435Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.5281004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5281724Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5282491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.5283255Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.5284018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.5284736Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.5285378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.5285927Z fn() 2025-05-07T20:32:35.5286456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.5287074Z self.fn.run( 2025-05-07T20:32:35.5287565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5288125Z kernel = self.compile( 2025-05-07T20:32:35.5288699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5289395Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5289818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5290058Z 2025-05-07T20:32:35.5290279Z self = 2025-05-07T20:32:35.5291435Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5292966Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8aee5ab600>} 2025-05-07T20:32:35.5294391Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5295450Z context = 2025-05-07T20:32:35.5295762Z 2025-05-07T20:32:35.5295933Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5296528Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5297014Z module_map=module_map) 2025-05-07T20:32:35.5297383Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5297752Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.5298024Z E ^ 2025-05-07T20:32:35.5298497Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5298971Z 2025-05-07T20:32:35.5299399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5299938Z 2025-05-07T20:32:35.5300042Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5300469Z self=, 2025-05-07T20:32:35.5300880Z T=4096, 2025-05-07T20:32:35.5301074Z D=5120, 2025-05-07T20:32:35.5301270Z scale_ub=None, 2025-05-07T20:32:35.5301487Z contiguous=False, 2025-05-07T20:32:35.5301725Z compiled=False, 2025-05-07T20:32:35.5301935Z ) 2025-05-07T20:32:35.5302263Z self = 2025-05-07T20:32:35.5302829Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.5303165Z 2025-05-07T20:32:35.5303251Z @given( 2025-05-07T20:32:35.5303485Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5303844Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5304165Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5304505Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5304838Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5305140Z ) 2025-05-07T20:32:35.5305551Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5306004Z def test_silu_mul_quant( 2025-05-07T20:32:35.5306513Z self, 2025-05-07T20:32:35.5306744Z T: int, 2025-05-07T20:32:35.5306953Z D: int, 2025-05-07T20:32:35.5307172Z scale_ub: Optional[float], 2025-05-07T20:32:35.5307454Z contiguous: bool, 2025-05-07T20:32:35.5307705Z compiled: bool, 2025-05-07T20:32:35.5307931Z ) -> None: 2025-05-07T20:32:35.5308152Z torch.manual_seed(2025) 2025-05-07T20:32:35.5308401Z 2025-05-07T20:32:35.5308679Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5309033Z 2025-05-07T20:32:35.5309234Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5309528Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5309849Z x = x_sign * x_clamp 2025-05-07T20:32:35.5310099Z x0 = x[:, :D] 2025-05-07T20:32:35.5310319Z x1 = x[:, D:] 2025-05-07T20:32:35.5310535Z 2025-05-07T20:32:35.5310728Z if contiguous: 2025-05-07T20:32:35.5310961Z x0 = x0.contiguous() 2025-05-07T20:32:35.5311229Z x1 = x1.contiguous() 2025-05-07T20:32:35.5311478Z 2025-05-07T20:32:35.5311668Z if scale_ub is not None: 2025-05-07T20:32:35.5311949Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5312295Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5312611Z ) 2025-05-07T20:32:35.5312803Z else: 2025-05-07T20:32:35.5313022Z scale_ub_tensor = None 2025-05-07T20:32:35.5313283Z 2025-05-07T20:32:35.5313515Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5313836Z op = silu_mul_quant 2025-05-07T20:32:35.5314093Z if compiled: 2025-05-07T20:32:35.5314341Z op = torch.compile(op) 2025-05-07T20:32:35.5314645Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5314929Z 2025-05-07T20:32:35.5315121Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5315293Z 2025-05-07T20:32:35.5315393Z moe/activation_test.py:117: 2025-05-07T20:32:35.5315791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5316138Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5316421Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5317138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5317851Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5318406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5319115Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5319805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5320361Z kernel = self.compile( 2025-05-07T20:32:35.5320915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5321598Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5322008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5322351Z 2025-05-07T20:32:35.5322569Z self = 2025-05-07T20:32:35.5337464Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5339092Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8aee5ab420>} 2025-05-07T20:32:35.5340774Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5341850Z context = 2025-05-07T20:32:35.5342152Z 2025-05-07T20:32:35.5342328Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5342874Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5343361Z module_map=module_map) 2025-05-07T20:32:35.5343746Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5344107Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5344381Z E ^ 2025-05-07T20:32:35.5344866Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5345337Z 2025-05-07T20:32:35.5345770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5346310Z 2025-05-07T20:32:35.5346423Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5346854Z self=, 2025-05-07T20:32:35.5347270Z T=4096, 2025-05-07T20:32:35.5347463Z D=7168, 2025-05-07T20:32:35.5347664Z scale_ub=None, 2025-05-07T20:32:35.5347887Z contiguous=False, 2025-05-07T20:32:35.5348113Z compiled=False, 2025-05-07T20:32:35.5348326Z ) 2025-05-07T20:32:35.5348661Z self = 2025-05-07T20:32:35.5349167Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.5349456Z 2025-05-07T20:32:35.5349535Z @given( 2025-05-07T20:32:35.5349773Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5350099Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5350406Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5350746Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5351146Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5351435Z ) 2025-05-07T20:32:35.5351798Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5352250Z def test_silu_mul_quant( 2025-05-07T20:32:35.5352490Z self, 2025-05-07T20:32:35.5352687Z T: int, 2025-05-07T20:32:35.5352887Z D: int, 2025-05-07T20:32:35.5353106Z scale_ub: Optional[float], 2025-05-07T20:32:35.5353381Z contiguous: bool, 2025-05-07T20:32:35.5353627Z compiled: bool, 2025-05-07T20:32:35.5353848Z ) -> None: 2025-05-07T20:32:35.5354067Z torch.manual_seed(2025) 2025-05-07T20:32:35.5354317Z 2025-05-07T20:32:35.5354599Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5354953Z 2025-05-07T20:32:35.5355151Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5355451Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5355768Z x = x_sign * x_clamp 2025-05-07T20:32:35.5356011Z x0 = x[:, :D] 2025-05-07T20:32:35.5356234Z x1 = x[:, D:] 2025-05-07T20:32:35.5356439Z 2025-05-07T20:32:35.5356676Z if contiguous: 2025-05-07T20:32:35.5356912Z x0 = x0.contiguous() 2025-05-07T20:32:35.5357168Z x1 = x1.contiguous() 2025-05-07T20:32:35.5357411Z 2025-05-07T20:32:35.5357605Z if scale_ub is not None: 2025-05-07T20:32:35.5357922Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5358263Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5358583Z ) 2025-05-07T20:32:35.5358777Z else: 2025-05-07T20:32:35.5358993Z scale_ub_tensor = None 2025-05-07T20:32:35.5359295Z 2025-05-07T20:32:35.5359530Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5359848Z op = silu_mul_quant 2025-05-07T20:32:35.5360104Z if compiled: 2025-05-07T20:32:35.5360362Z op = torch.compile(op) 2025-05-07T20:32:35.5360657Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5360935Z 2025-05-07T20:32:35.5361132Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5361302Z 2025-05-07T20:32:35.5361404Z moe/activation_test.py:117: 2025-05-07T20:32:35.5361706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5362051Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5362331Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5363049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5363816Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5364377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5365079Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5365769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5366319Z kernel = self.compile( 2025-05-07T20:32:35.5366881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5367555Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5367967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5368204Z 2025-05-07T20:32:35.5368429Z self = 2025-05-07T20:32:35.5369546Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5371019Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8aee58c360>} 2025-05-07T20:32:35.5372495Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5373592Z context = 2025-05-07T20:32:35.5373916Z 2025-05-07T20:32:35.5374091Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5374623Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5374736Z module_map=module_map) 2025-05-07T20:32:35.5374906Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5375007Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5375091Z E ^ 2025-05-07T20:32:35.5375459Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5375464Z 2025-05-07T20:32:35.5375946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5375951Z 2025-05-07T20:32:35.5376057Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5376325Z self=, 2025-05-07T20:32:35.5376409Z T=128, 2025-05-07T20:32:35.5376487Z D=7168, 2025-05-07T20:32:35.5376571Z scale_ub=None, 2025-05-07T20:32:35.5376664Z contiguous=False, 2025-05-07T20:32:35.5376751Z compiled=True, 2025-05-07T20:32:35.5376831Z ) 2025-05-07T20:32:35.5377099Z self = 2025-05-07T20:32:35.5377273Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.5377277Z 2025-05-07T20:32:35.5377362Z @given( 2025-05-07T20:32:35.5377485Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5377586Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5377709Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5377826Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5377951Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5378030Z ) 2025-05-07T20:32:35.5378283Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5378381Z def test_silu_mul_quant( 2025-05-07T20:32:35.5378458Z self, 2025-05-07T20:32:35.5378534Z T: int, 2025-05-07T20:32:35.5378615Z D: int, 2025-05-07T20:32:35.5378716Z scale_ub: Optional[float], 2025-05-07T20:32:35.5378806Z contiguous: bool, 2025-05-07T20:32:35.5378898Z compiled: bool, 2025-05-07T20:32:35.5378976Z ) -> None: 2025-05-07T20:32:35.5379071Z torch.manual_seed(2025) 2025-05-07T20:32:35.5379153Z 2025-05-07T20:32:35.5379324Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5379400Z 2025-05-07T20:32:35.5379504Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5379629Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5379725Z x = x_sign * x_clamp 2025-05-07T20:32:35.5379807Z x0 = x[:, :D] 2025-05-07T20:32:35.5379888Z x1 = x[:, D:] 2025-05-07T20:32:35.5379969Z 2025-05-07T20:32:35.5380056Z if contiguous: 2025-05-07T20:32:35.5380151Z x0 = x0.contiguous() 2025-05-07T20:32:35.5380245Z x1 = x1.contiguous() 2025-05-07T20:32:35.5380316Z 2025-05-07T20:32:35.5380406Z if scale_ub is not None: 2025-05-07T20:32:35.5380519Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5380656Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5380732Z ) 2025-05-07T20:32:35.5380815Z else: 2025-05-07T20:32:35.5380957Z scale_ub_tensor = None 2025-05-07T20:32:35.5381036Z 2025-05-07T20:32:35.5381165Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5381258Z op = silu_mul_quant 2025-05-07T20:32:35.5381348Z if compiled: 2025-05-07T20:32:35.5381449Z op = torch.compile(op) 2025-05-07T20:32:35.5381556Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5381636Z 2025-05-07T20:32:35.5381727Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.5381846Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.5381923Z 2025-05-07T20:32:35.5382060Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5382166Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.5382273Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.5382395Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.5382544Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.5382617Z 2025-05-07T20:32:35.5382717Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.5382721Z 2025-05-07T20:32:35.5382868Z moe/activation_test.py:126: 2025-05-07T20:32:35.5383000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5383105Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.5383316Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.5383900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.5384006Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.5384418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5384647Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5385034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.5385300Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.5385690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.5385865Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.5386217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.5386300Z fn() 2025-05-07T20:32:35.5386714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.5386800Z self.fn.run( 2025-05-07T20:32:35.5387154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5387251Z kernel = self.compile( 2025-05-07T20:32:35.5387650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5387832Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5387963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5387968Z 2025-05-07T20:32:35.5388184Z self = 2025-05-07T20:32:35.5388990Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5389515Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac9f5e7a0>} 2025-05-07T20:32:35.5390338Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5390537Z context = 2025-05-07T20:32:35.5390542Z 2025-05-07T20:32:35.5390714Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5390988Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5391099Z module_map=module_map) 2025-05-07T20:32:35.5391265Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5391369Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.5391454Z E ^ 2025-05-07T20:32:35.5391821Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5391826Z 2025-05-07T20:32:35.5392256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5392266Z 2025-05-07T20:32:35.5392412Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5392643Z self=, 2025-05-07T20:32:35.5392744Z T=128, 2025-05-07T20:32:35.5392827Z D=7168, 2025-05-07T20:32:35.5392973Z scale_ub=None, 2025-05-07T20:32:35.5393071Z contiguous=False, 2025-05-07T20:32:35.5393159Z compiled=False, 2025-05-07T20:32:35.5393231Z ) 2025-05-07T20:32:35.5393461Z self = 2025-05-07T20:32:35.5393640Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.5393684Z 2025-05-07T20:32:35.5393770Z @given( 2025-05-07T20:32:35.5393893Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5393995Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5394117Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5394237Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5394357Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5394438Z ) 2025-05-07T20:32:35.5394691Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5394788Z def test_silu_mul_quant( 2025-05-07T20:32:35.5394871Z self, 2025-05-07T20:32:35.5394949Z T: int, 2025-05-07T20:32:35.5395025Z D: int, 2025-05-07T20:32:35.5395130Z scale_ub: Optional[float], 2025-05-07T20:32:35.5395222Z contiguous: bool, 2025-05-07T20:32:35.5395315Z compiled: bool, 2025-05-07T20:32:35.5395397Z ) -> None: 2025-05-07T20:32:35.5395492Z torch.manual_seed(2025) 2025-05-07T20:32:35.5395572Z 2025-05-07T20:32:35.5395745Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5395823Z 2025-05-07T20:32:35.5395930Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5396056Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5396151Z x = x_sign * x_clamp 2025-05-07T20:32:35.5396240Z x0 = x[:, :D] 2025-05-07T20:32:35.5396321Z x1 = x[:, D:] 2025-05-07T20:32:35.5396392Z 2025-05-07T20:32:35.5396482Z if contiguous: 2025-05-07T20:32:35.5396578Z x0 = x0.contiguous() 2025-05-07T20:32:35.5396674Z x1 = x1.contiguous() 2025-05-07T20:32:35.5396749Z 2025-05-07T20:32:35.5396844Z if scale_ub is not None: 2025-05-07T20:32:35.5396952Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5397090Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5397172Z ) 2025-05-07T20:32:35.5397254Z else: 2025-05-07T20:32:35.5397352Z scale_ub_tensor = None 2025-05-07T20:32:35.5397425Z 2025-05-07T20:32:35.5397607Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5397700Z op = silu_mul_quant 2025-05-07T20:32:35.5397786Z if compiled: 2025-05-07T20:32:35.5397894Z op = torch.compile(op) 2025-05-07T20:32:35.5398004Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5398085Z 2025-05-07T20:32:35.5398177Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5398181Z 2025-05-07T20:32:35.5398284Z moe/activation_test.py:117: 2025-05-07T20:32:35.5398421Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5398523Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5398624Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5399146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5399248Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5399624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5399853Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5400247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5400347Z kernel = self.compile( 2025-05-07T20:32:35.5400741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5400956Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5401094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5401098Z 2025-05-07T20:32:35.5401351Z self = 2025-05-07T20:32:35.5402163Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5402686Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac9f5e980>} 2025-05-07T20:32:35.5403493Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5403716Z context = 2025-05-07T20:32:35.5403721Z 2025-05-07T20:32:35.5403891Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5404172Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5404281Z module_map=module_map) 2025-05-07T20:32:35.5404448Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5404555Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5404634Z E ^ 2025-05-07T20:32:35.5405008Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5405013Z 2025-05-07T20:32:35.5405442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5405449Z 2025-05-07T20:32:35.5405554Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5405787Z self=, 2025-05-07T20:32:35.5405865Z T=4096, 2025-05-07T20:32:35.5405945Z D=5120, 2025-05-07T20:32:35.5406034Z scale_ub=1200.0, 2025-05-07T20:32:35.5406118Z contiguous=True, 2025-05-07T20:32:35.5406632Z compiled=False, 2025-05-07T20:32:35.5406710Z ) 2025-05-07T20:32:35.5407032Z self = 2025-05-07T20:32:35.5407223Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.5407228Z 2025-05-07T20:32:35.5407308Z @given( 2025-05-07T20:32:35.5407430Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5407538Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5407653Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5407779Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5407897Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5407970Z ) 2025-05-07T20:32:35.5408229Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5408330Z def test_silu_mul_quant( 2025-05-07T20:32:35.5408407Z self, 2025-05-07T20:32:35.5408495Z T: int, 2025-05-07T20:32:35.5408573Z D: int, 2025-05-07T20:32:35.5408676Z scale_ub: Optional[float], 2025-05-07T20:32:35.5408777Z contiguous: bool, 2025-05-07T20:32:35.5408864Z compiled: bool, 2025-05-07T20:32:35.5408944Z ) -> None: 2025-05-07T20:32:35.5409045Z torch.manual_seed(2025) 2025-05-07T20:32:35.5409184Z 2025-05-07T20:32:35.5409358Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5409441Z 2025-05-07T20:32:35.5409533Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5409722Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5409814Z x = x_sign * x_clamp 2025-05-07T20:32:35.5409898Z x0 = x[:, :D] 2025-05-07T20:32:35.5409987Z x1 = x[:, D:] 2025-05-07T20:32:35.5410062Z 2025-05-07T20:32:35.5410147Z if contiguous: 2025-05-07T20:32:35.5410309Z x0 = x0.contiguous() 2025-05-07T20:32:35.5410399Z x1 = x1.contiguous() 2025-05-07T20:32:35.5410473Z 2025-05-07T20:32:35.5410574Z if scale_ub is not None: 2025-05-07T20:32:35.5410686Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5410823Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5410907Z ) 2025-05-07T20:32:35.5410986Z else: 2025-05-07T20:32:35.5411086Z scale_ub_tensor = None 2025-05-07T20:32:35.5411168Z 2025-05-07T20:32:35.5411298Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5411395Z op = silu_mul_quant 2025-05-07T20:32:35.5411486Z if compiled: 2025-05-07T20:32:35.5411586Z op = torch.compile(op) 2025-05-07T20:32:35.5411697Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5411771Z 2025-05-07T20:32:35.5411916Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5411921Z 2025-05-07T20:32:35.5412040Z moe/activation_test.py:117: 2025-05-07T20:32:35.5412174Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5412277Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5412386Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5412910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5413011Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5413381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5413614Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5413975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5414071Z kernel = self.compile( 2025-05-07T20:32:35.5414473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5414655Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5414837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5414841Z 2025-05-07T20:32:35.5415061Z self = 2025-05-07T20:32:35.5415871Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5416400Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac9f5f9c0>} 2025-05-07T20:32:35.5417181Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5417387Z context = 2025-05-07T20:32:35.5417391Z 2025-05-07T20:32:35.5417572Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5417913Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5418027Z module_map=module_map) 2025-05-07T20:32:35.5418192Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5418331Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5418415Z E ^ 2025-05-07T20:32:35.5418782Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5418787Z 2025-05-07T20:32:35.5419216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5419265Z 2025-05-07T20:32:35.5419384Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5419690Z self=, 2025-05-07T20:32:35.5419810Z T=1, 2025-05-07T20:32:35.5419888Z D=5120, 2025-05-07T20:32:35.5419973Z scale_ub=None, 2025-05-07T20:32:35.5420063Z contiguous=True, 2025-05-07T20:32:35.5420150Z compiled=True, 2025-05-07T20:32:35.5420224Z ) 2025-05-07T20:32:35.5420456Z self = 2025-05-07T20:32:35.5420628Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.5420636Z 2025-05-07T20:32:35.5420718Z @given( 2025-05-07T20:32:35.5420838Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5420937Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5421056Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5421176Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5421294Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5421377Z ) 2025-05-07T20:32:35.5421631Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5421726Z def test_silu_mul_quant( 2025-05-07T20:32:35.5421813Z self, 2025-05-07T20:32:35.5421896Z T: int, 2025-05-07T20:32:35.5421975Z D: int, 2025-05-07T20:32:35.5422079Z scale_ub: Optional[float], 2025-05-07T20:32:35.5422170Z contiguous: bool, 2025-05-07T20:32:35.5422268Z compiled: bool, 2025-05-07T20:32:35.5422350Z ) -> None: 2025-05-07T20:32:35.5422446Z torch.manual_seed(2025) 2025-05-07T20:32:35.5422527Z 2025-05-07T20:32:35.5422698Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5422772Z 2025-05-07T20:32:35.5422876Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5423005Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5423096Z x = x_sign * x_clamp 2025-05-07T20:32:35.5423182Z x0 = x[:, :D] 2025-05-07T20:32:35.5423265Z x1 = x[:, D:] 2025-05-07T20:32:35.5423338Z 2025-05-07T20:32:35.5423485Z if contiguous: 2025-05-07T20:32:35.5423583Z x0 = x0.contiguous() 2025-05-07T20:32:35.5423679Z x1 = x1.contiguous() 2025-05-07T20:32:35.5423757Z 2025-05-07T20:32:35.5424431Z if scale_ub is not None: 2025-05-07T20:32:35.5424545Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5424682Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5424762Z ) 2025-05-07T20:32:35.5424848Z else: 2025-05-07T20:32:35.5424946Z scale_ub_tensor = None 2025-05-07T20:32:35.5425019Z 2025-05-07T20:32:35.5425153Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5425245Z op = silu_mul_quant 2025-05-07T20:32:35.5425336Z if compiled: 2025-05-07T20:32:35.5425443Z op = torch.compile(op) 2025-05-07T20:32:35.5425550Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5425627Z 2025-05-07T20:32:35.5425724Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.5425846Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.5425927Z 2025-05-07T20:32:35.5426113Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5426217Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.5426324Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.5426488Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.5426632Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.5426712Z 2025-05-07T20:32:35.5426814Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.5426818Z 2025-05-07T20:32:35.5426924Z moe/activation_test.py:126: 2025-05-07T20:32:35.5427096Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5427202Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.5427346Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.5427928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.5428033Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.5428412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5428646Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5429031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.5429298Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.5429694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.5429872Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.5430232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.5430318Z fn() 2025-05-07T20:32:35.5430738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.5430824Z self.fn.run( 2025-05-07T20:32:35.5431180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5431279Z kernel = self.compile( 2025-05-07T20:32:35.5431674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5431860Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5431998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5432002Z 2025-05-07T20:32:35.5432218Z self = 2025-05-07T20:32:35.5433073Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5433601Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8aee57c400>} 2025-05-07T20:32:35.5434388Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5434588Z context = 2025-05-07T20:32:35.5434595Z 2025-05-07T20:32:35.5434771Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5435045Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5435155Z module_map=module_map) 2025-05-07T20:32:35.5435367Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5435479Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.5435563Z E ^ 2025-05-07T20:32:35.5435931Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5435975Z 2025-05-07T20:32:35.5436407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5436412Z 2025-05-07T20:32:35.5436525Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5436797Z self=, 2025-05-07T20:32:35.5436882Z T=2048, 2025-05-07T20:32:35.5436959Z D=5120, 2025-05-07T20:32:35.5437043Z scale_ub=None, 2025-05-07T20:32:35.5437141Z contiguous=True, 2025-05-07T20:32:35.5437226Z compiled=True, 2025-05-07T20:32:35.5437301Z ) 2025-05-07T20:32:35.5437534Z self = 2025-05-07T20:32:35.5437715Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.5437720Z 2025-05-07T20:32:35.5437800Z @given( 2025-05-07T20:32:35.5437930Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5438036Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5438161Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5438286Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5438403Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5438491Z ) 2025-05-07T20:32:35.5438745Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5438842Z def test_silu_mul_quant( 2025-05-07T20:32:35.5438927Z self, 2025-05-07T20:32:35.5439009Z T: int, 2025-05-07T20:32:35.5439090Z D: int, 2025-05-07T20:32:35.5439196Z scale_ub: Optional[float], 2025-05-07T20:32:35.5439290Z contiguous: bool, 2025-05-07T20:32:35.5439380Z compiled: bool, 2025-05-07T20:32:35.5439468Z ) -> None: 2025-05-07T20:32:35.5439569Z torch.manual_seed(2025) 2025-05-07T20:32:35.5439646Z 2025-05-07T20:32:35.5439835Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5439912Z 2025-05-07T20:32:35.5440012Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5440141Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5440234Z x = x_sign * x_clamp 2025-05-07T20:32:35.5440323Z x0 = x[:, :D] 2025-05-07T20:32:35.5440409Z x1 = x[:, D:] 2025-05-07T20:32:35.5440484Z 2025-05-07T20:32:35.5440577Z if contiguous: 2025-05-07T20:32:35.5440671Z x0 = x0.contiguous() 2025-05-07T20:32:35.5440813Z x1 = x1.contiguous() 2025-05-07T20:32:35.5440896Z 2025-05-07T20:32:35.5440990Z if scale_ub is not None: 2025-05-07T20:32:35.5441098Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5441243Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5441327Z ) 2025-05-07T20:32:35.5441416Z else: 2025-05-07T20:32:35.5441516Z scale_ub_tensor = None 2025-05-07T20:32:35.5441591Z 2025-05-07T20:32:35.5441728Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5441821Z op = silu_mul_quant 2025-05-07T20:32:35.5441908Z if compiled: 2025-05-07T20:32:35.5442012Z op = torch.compile(op) 2025-05-07T20:32:35.5442118Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5442196Z 2025-05-07T20:32:35.5442293Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.5442417Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.5442490Z 2025-05-07T20:32:35.5442637Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5442740Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.5442894Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.5443022Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.5443167Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.5443290Z 2025-05-07T20:32:35.5443393Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.5443397Z 2025-05-07T20:32:35.5443497Z moe/activation_test.py:126: 2025-05-07T20:32:35.5443636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5443744Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.5443927Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.5444507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.5444610Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.5444987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5445216Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5445593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.5445864Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.5446252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.5446435Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.5446791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.5446868Z fn() 2025-05-07T20:32:35.5447290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.5447378Z self.fn.run( 2025-05-07T20:32:35.5447727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5447830Z kernel = self.compile( 2025-05-07T20:32:35.5448226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5448412Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5448544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5448552Z 2025-05-07T20:32:35.5448762Z self = 2025-05-07T20:32:35.5449647Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5450169Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac9d07ba0>} 2025-05-07T20:32:35.5450946Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5451144Z context = 2025-05-07T20:32:35.5451149Z 2025-05-07T20:32:35.5451323Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5451599Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5451708Z module_map=module_map) 2025-05-07T20:32:35.5451941Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5452048Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.5452127Z E ^ 2025-05-07T20:32:35.5452547Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5452552Z 2025-05-07T20:32:35.5452982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5453023Z 2025-05-07T20:32:35.5453136Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5453366Z self=, 2025-05-07T20:32:35.5453444Z T=128, 2025-05-07T20:32:35.5453565Z D=5120, 2025-05-07T20:32:35.5453650Z scale_ub=None, 2025-05-07T20:32:35.5453735Z contiguous=True, 2025-05-07T20:32:35.5453827Z compiled=True, 2025-05-07T20:32:35.5453901Z ) 2025-05-07T20:32:35.5454130Z self = 2025-05-07T20:32:35.5454309Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.5454313Z 2025-05-07T20:32:35.5454398Z @given( 2025-05-07T20:32:35.5454523Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5454623Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5454738Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5454864Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5454978Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5455054Z ) 2025-05-07T20:32:35.5455309Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5455405Z def test_silu_mul_quant( 2025-05-07T20:32:35.5455489Z self, 2025-05-07T20:32:35.5455565Z T: int, 2025-05-07T20:32:35.5455643Z D: int, 2025-05-07T20:32:35.5455747Z scale_ub: Optional[float], 2025-05-07T20:32:35.5455838Z contiguous: bool, 2025-05-07T20:32:35.5455924Z compiled: bool, 2025-05-07T20:32:35.5456008Z ) -> None: 2025-05-07T20:32:35.5456110Z torch.manual_seed(2025) 2025-05-07T20:32:35.5456187Z 2025-05-07T20:32:35.5456364Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5456439Z 2025-05-07T20:32:35.5456534Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5456667Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5456756Z x = x_sign * x_clamp 2025-05-07T20:32:35.5456838Z x0 = x[:, :D] 2025-05-07T20:32:35.5456924Z x1 = x[:, D:] 2025-05-07T20:32:35.5456999Z 2025-05-07T20:32:35.5457089Z if contiguous: 2025-05-07T20:32:35.5457184Z x0 = x0.contiguous() 2025-05-07T20:32:35.5457276Z x1 = x1.contiguous() 2025-05-07T20:32:35.5457353Z 2025-05-07T20:32:35.5457444Z if scale_ub is not None: 2025-05-07T20:32:35.5457597Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5457744Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5457820Z ) 2025-05-07T20:32:35.5457899Z else: 2025-05-07T20:32:35.5457998Z scale_ub_tensor = None 2025-05-07T20:32:35.5458071Z 2025-05-07T20:32:35.5458200Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5458302Z op = silu_mul_quant 2025-05-07T20:32:35.5458388Z if compiled: 2025-05-07T20:32:35.5458494Z op = torch.compile(op) 2025-05-07T20:32:35.5458600Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5458672Z 2025-05-07T20:32:35.5458772Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.5458897Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.5458970Z 2025-05-07T20:32:35.5459112Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5459215Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.5459317Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.5459445Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.5459632Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.5459710Z 2025-05-07T20:32:35.5459811Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.5459816Z 2025-05-07T20:32:35.5459954Z moe/activation_test.py:126: 2025-05-07T20:32:35.5460092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5460195Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.5460330Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.5460914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.5461058Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.5461438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5461674Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5462056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.5462329Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.5462722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.5462895Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.5463256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.5463337Z fn() 2025-05-07T20:32:35.5463757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.5463841Z self.fn.run( 2025-05-07T20:32:35.5464194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5464298Z kernel = self.compile( 2025-05-07T20:32:35.5464695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5464875Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5465014Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5465019Z 2025-05-07T20:32:35.5465233Z self = 2025-05-07T20:32:35.5466049Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5466621Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac90c9300>} 2025-05-07T20:32:35.5467409Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5467610Z context = 2025-05-07T20:32:35.5467614Z 2025-05-07T20:32:35.5467785Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5468065Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5468179Z module_map=module_map) 2025-05-07T20:32:35.5468351Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5468458Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.5468542Z E ^ 2025-05-07T20:32:35.5468918Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5468922Z 2025-05-07T20:32:35.5469398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5469403Z 2025-05-07T20:32:35.5469510Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5469786Z self=, 2025-05-07T20:32:35.5469868Z T=4096, 2025-05-07T20:32:35.5469952Z D=5120, 2025-05-07T20:32:35.5470038Z scale_ub=None, 2025-05-07T20:32:35.5470124Z contiguous=True, 2025-05-07T20:32:35.5470215Z compiled=True, 2025-05-07T20:32:35.5470333Z ) 2025-05-07T20:32:35.5470563Z self = 2025-05-07T20:32:35.5470750Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.5470758Z 2025-05-07T20:32:35.5476489Z @given( 2025-05-07T20:32:35.5476630Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5476745Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5476859Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5476981Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5477095Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5477173Z ) 2025-05-07T20:32:35.5477435Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5477530Z def test_silu_mul_quant( 2025-05-07T20:32:35.5477605Z self, 2025-05-07T20:32:35.5477687Z T: int, 2025-05-07T20:32:35.5477767Z D: int, 2025-05-07T20:32:35.5477864Z scale_ub: Optional[float], 2025-05-07T20:32:35.5477958Z contiguous: bool, 2025-05-07T20:32:35.5478044Z compiled: bool, 2025-05-07T20:32:35.5478128Z ) -> None: 2025-05-07T20:32:35.5478224Z torch.manual_seed(2025) 2025-05-07T20:32:35.5478298Z 2025-05-07T20:32:35.5478477Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5478555Z 2025-05-07T20:32:35.5478650Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5478783Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5478872Z x = x_sign * x_clamp 2025-05-07T20:32:35.5478956Z x0 = x[:, :D] 2025-05-07T20:32:35.5479042Z x1 = x[:, D:] 2025-05-07T20:32:35.5479117Z 2025-05-07T20:32:35.5479201Z if contiguous: 2025-05-07T20:32:35.5479297Z x0 = x0.contiguous() 2025-05-07T20:32:35.5479387Z x1 = x1.contiguous() 2025-05-07T20:32:35.5479461Z 2025-05-07T20:32:35.5479563Z if scale_ub is not None: 2025-05-07T20:32:35.5479671Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5479813Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5479890Z ) 2025-05-07T20:32:35.5480036Z else: 2025-05-07T20:32:35.5480135Z scale_ub_tensor = None 2025-05-07T20:32:35.5480210Z 2025-05-07T20:32:35.5480346Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5480442Z op = silu_mul_quant 2025-05-07T20:32:35.5480529Z if compiled: 2025-05-07T20:32:35.5480630Z op = torch.compile(op) 2025-05-07T20:32:35.5480741Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5480816Z 2025-05-07T20:32:35.5480909Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.5481034Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.5481108Z 2025-05-07T20:32:35.5481251Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5481361Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.5481461Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.5481590Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.5481736Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.5481810Z 2025-05-07T20:32:35.5481962Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.5481968Z 2025-05-07T20:32:35.5482071Z moe/activation_test.py:126: 2025-05-07T20:32:35.5482208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5482359Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.5482497Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.5483143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.5483244Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.5483686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5483924Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5484310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.5484582Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.5484975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.5485149Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.5485507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.5485583Z fn() 2025-05-07T20:32:35.5486000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.5486089Z self.fn.run( 2025-05-07T20:32:35.5486442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5486543Z kernel = self.compile( 2025-05-07T20:32:35.5486943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5487123Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5487259Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5487267Z 2025-05-07T20:32:35.5487479Z self = 2025-05-07T20:32:35.5488298Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5488822Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac902a340>} 2025-05-07T20:32:35.5489650Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5489852Z context = 2025-05-07T20:32:35.5489857Z 2025-05-07T20:32:35.5490024Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5490301Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5490411Z module_map=module_map) 2025-05-07T20:32:35.5490577Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5490685Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.5490761Z E ^ 2025-05-07T20:32:35.5491132Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5491139Z 2025-05-07T20:32:35.5491565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5491613Z 2025-05-07T20:32:35.5491719Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5492011Z self=, 2025-05-07T20:32:35.5492132Z T=16384, 2025-05-07T20:32:35.5492206Z D=5120, 2025-05-07T20:32:35.5492296Z scale_ub=None, 2025-05-07T20:32:35.5492378Z contiguous=True, 2025-05-07T20:32:35.5492466Z compiled=True, 2025-05-07T20:32:35.5492539Z ) 2025-05-07T20:32:35.5492765Z self = 2025-05-07T20:32:35.5492992Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.5492997Z 2025-05-07T20:32:35.5493073Z @given( 2025-05-07T20:32:35.5493195Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5493311Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5493427Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5493547Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5493666Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5493742Z ) 2025-05-07T20:32:35.5494000Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5494099Z def test_silu_mul_quant( 2025-05-07T20:32:35.5494177Z self, 2025-05-07T20:32:35.5494264Z T: int, 2025-05-07T20:32:35.5494345Z D: int, 2025-05-07T20:32:35.5494446Z scale_ub: Optional[float], 2025-05-07T20:32:35.5494538Z contiguous: bool, 2025-05-07T20:32:35.5494627Z compiled: bool, 2025-05-07T20:32:35.5494706Z ) -> None: 2025-05-07T20:32:35.5494803Z torch.manual_seed(2025) 2025-05-07T20:32:35.5494877Z 2025-05-07T20:32:35.5495052Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5495130Z 2025-05-07T20:32:35.5495224Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5495357Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5495449Z x = x_sign * x_clamp 2025-05-07T20:32:35.5495530Z x0 = x[:, :D] 2025-05-07T20:32:35.5495615Z x1 = x[:, D:] 2025-05-07T20:32:35.5495686Z 2025-05-07T20:32:35.5495773Z if contiguous: 2025-05-07T20:32:35.5495870Z x0 = x0.contiguous() 2025-05-07T20:32:35.5495959Z x1 = x1.contiguous() 2025-05-07T20:32:35.5496033Z 2025-05-07T20:32:35.5496131Z if scale_ub is not None: 2025-05-07T20:32:35.5496237Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5496378Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5496461Z ) 2025-05-07T20:32:35.5496539Z else: 2025-05-07T20:32:35.5496636Z scale_ub_tensor = None 2025-05-07T20:32:35.5496715Z 2025-05-07T20:32:35.5496892Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5496991Z op = silu_mul_quant 2025-05-07T20:32:35.5497076Z if compiled: 2025-05-07T20:32:35.5497179Z op = torch.compile(op) 2025-05-07T20:32:35.5497288Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5497360Z 2025-05-07T20:32:35.5497451Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.5497580Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.5497655Z 2025-05-07T20:32:35.5497791Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5497900Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.5498002Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.5498132Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.5498273Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.5498348Z 2025-05-07T20:32:35.5498454Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.5498458Z 2025-05-07T20:32:35.5498560Z moe/activation_test.py:126: 2025-05-07T20:32:35.5498734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5498846Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.5498980Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.5499600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.5499710Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.5500078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5500349Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5500727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.5500989Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.5501382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.5501551Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.5501909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.5501992Z fn() 2025-05-07T20:32:35.5502405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.5502497Z self.fn.run( 2025-05-07T20:32:35.5502880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5502996Z kernel = self.compile( 2025-05-07T20:32:35.5503393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5503573Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5503715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5503719Z 2025-05-07T20:32:35.5503928Z self = 2025-05-07T20:32:35.5504737Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5505261Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac87a6d40>} 2025-05-07T20:32:35.5506080Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5506543Z context = 2025-05-07T20:32:35.5506555Z 2025-05-07T20:32:35.5506782Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5507064Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5507179Z module_map=module_map) 2025-05-07T20:32:35.5507348Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5507458Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.5507537Z E ^ 2025-05-07T20:32:35.5507909Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5507916Z 2025-05-07T20:32:35.5508355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5508360Z 2025-05-07T20:32:35.5508467Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5508793Z self=, 2025-05-07T20:32:35.5508874Z T=1, 2025-05-07T20:32:35.5508951Z D=5120, 2025-05-07T20:32:35.5509043Z scale_ub=1200.0, 2025-05-07T20:32:35.5509129Z contiguous=True, 2025-05-07T20:32:35.5509269Z compiled=True, 2025-05-07T20:32:35.5509348Z ) 2025-05-07T20:32:35.5509575Z self = 2025-05-07T20:32:35.5509746Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.5509756Z 2025-05-07T20:32:35.5509836Z @given( 2025-05-07T20:32:35.5510019Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5510125Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5510244Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5510367Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5510490Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5510569Z ) 2025-05-07T20:32:35.5510825Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5510925Z def test_silu_mul_quant( 2025-05-07T20:32:35.5511003Z self, 2025-05-07T20:32:35.5511085Z T: int, 2025-05-07T20:32:35.5511168Z D: int, 2025-05-07T20:32:35.5511269Z scale_ub: Optional[float], 2025-05-07T20:32:35.5511366Z contiguous: bool, 2025-05-07T20:32:35.5511454Z compiled: bool, 2025-05-07T20:32:35.5511537Z ) -> None: 2025-05-07T20:32:35.5511639Z torch.manual_seed(2025) 2025-05-07T20:32:35.5511718Z 2025-05-07T20:32:35.5511890Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5511970Z 2025-05-07T20:32:35.5512064Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5512191Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5512286Z x = x_sign * x_clamp 2025-05-07T20:32:35.5512369Z x0 = x[:, :D] 2025-05-07T20:32:35.5512454Z x1 = x[:, D:] 2025-05-07T20:32:35.5512535Z 2025-05-07T20:32:35.5512620Z if contiguous: 2025-05-07T20:32:35.5512714Z x0 = x0.contiguous() 2025-05-07T20:32:35.5512811Z x1 = x1.contiguous() 2025-05-07T20:32:35.5512887Z 2025-05-07T20:32:35.5512983Z if scale_ub is not None: 2025-05-07T20:32:35.5513090Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5513229Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5513310Z ) 2025-05-07T20:32:35.5513387Z else: 2025-05-07T20:32:35.5513482Z scale_ub_tensor = None 2025-05-07T20:32:35.5513558Z 2025-05-07T20:32:35.5513687Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5513779Z op = silu_mul_quant 2025-05-07T20:32:35.5513937Z if compiled: 2025-05-07T20:32:35.5514039Z op = torch.compile(op) 2025-05-07T20:32:35.5514149Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5514232Z 2025-05-07T20:32:35.5514325Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5514329Z 2025-05-07T20:32:35.5514429Z moe/activation_test.py:117: 2025-05-07T20:32:35.5514561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5514666Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5514773Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5515149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.5515245Z return fn(*args, **kwargs) 2025-05-07T20:32:35.5515757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5515860Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5516228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5516499Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5516851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5517013Z kernel = self.compile( 2025-05-07T20:32:35.5517409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5517594Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5517725Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5517768Z 2025-05-07T20:32:35.5517977Z self = 2025-05-07T20:32:35.5518789Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5519311Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac82a19e0>} 2025-05-07T20:32:35.5520091Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5520324Z context = 2025-05-07T20:32:35.5520331Z 2025-05-07T20:32:35.5520567Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5520841Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5520953Z module_map=module_map) 2025-05-07T20:32:35.5521121Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5521224Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5521305Z E ^ 2025-05-07T20:32:35.5521673Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5521681Z 2025-05-07T20:32:35.5522108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5522113Z 2025-05-07T20:32:35.5522221Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5522448Z self=, 2025-05-07T20:32:35.5522528Z T=1, 2025-05-07T20:32:35.5522611Z D=5120, 2025-05-07T20:32:35.5522700Z scale_ub=None, 2025-05-07T20:32:35.5522787Z contiguous=False, 2025-05-07T20:32:35.5522885Z compiled=True, 2025-05-07T20:32:35.5522975Z ) 2025-05-07T20:32:35.5523377Z self = 2025-05-07T20:32:35.5523553Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.5523558Z 2025-05-07T20:32:35.5523636Z @given( 2025-05-07T20:32:35.5523758Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5523866Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5523983Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5524107Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5524222Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5524298Z ) 2025-05-07T20:32:35.5524556Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5524655Z def test_silu_mul_quant( 2025-05-07T20:32:35.5524733Z self, 2025-05-07T20:32:35.5524816Z T: int, 2025-05-07T20:32:35.5524894Z D: int, 2025-05-07T20:32:35.5524995Z scale_ub: Optional[float], 2025-05-07T20:32:35.5525089Z contiguous: bool, 2025-05-07T20:32:35.5525176Z compiled: bool, 2025-05-07T20:32:35.5525304Z ) -> None: 2025-05-07T20:32:35.5525401Z torch.manual_seed(2025) 2025-05-07T20:32:35.5525472Z 2025-05-07T20:32:35.5525646Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5525761Z 2025-05-07T20:32:35.5525856Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5525988Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5526078Z x = x_sign * x_clamp 2025-05-07T20:32:35.5526160Z x0 = x[:, :D] 2025-05-07T20:32:35.5526249Z x1 = x[:, D:] 2025-05-07T20:32:35.5526365Z 2025-05-07T20:32:35.5526450Z if contiguous: 2025-05-07T20:32:35.5526550Z x0 = x0.contiguous() 2025-05-07T20:32:35.5526645Z x1 = x1.contiguous() 2025-05-07T20:32:35.5526718Z 2025-05-07T20:32:35.5526818Z if scale_ub is not None: 2025-05-07T20:32:35.5526923Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5527062Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5527143Z ) 2025-05-07T20:32:35.5527223Z else: 2025-05-07T20:32:35.5527323Z scale_ub_tensor = None 2025-05-07T20:32:35.5527396Z 2025-05-07T20:32:35.5527524Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5527623Z op = silu_mul_quant 2025-05-07T20:32:35.5527710Z if compiled: 2025-05-07T20:32:35.5527810Z op = torch.compile(op) 2025-05-07T20:32:35.5527922Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5527995Z 2025-05-07T20:32:35.5528090Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.5528217Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.5528288Z 2025-05-07T20:32:35.5528432Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5528537Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.5528639Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.5528769Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.5528912Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.5528988Z 2025-05-07T20:32:35.5529097Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.5529103Z 2025-05-07T20:32:35.5529201Z moe/activation_test.py:126: 2025-05-07T20:32:35.5529333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5529446Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.5529581Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.5530168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.5530270Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.5530688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5530926Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5531307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.5531580Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.5532072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.5532247Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.5532609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.5532690Z fn() 2025-05-07T20:32:35.5533108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.5533196Z self.fn.run( 2025-05-07T20:32:35.5533594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5533695Z kernel = self.compile( 2025-05-07T20:32:35.5534092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5534312Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5534449Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5534454Z 2025-05-07T20:32:35.5534666Z self = 2025-05-07T20:32:35.5535522Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5536049Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac82a0680>} 2025-05-07T20:32:35.5536825Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5537030Z context = 2025-05-07T20:32:35.5537034Z 2025-05-07T20:32:35.5537206Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5537481Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5537594Z module_map=module_map) 2025-05-07T20:32:35.5537760Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5537873Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.5537952Z E ^ 2025-05-07T20:32:35.5538323Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5538333Z 2025-05-07T20:32:35.5538766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5538773Z 2025-05-07T20:32:35.5538880Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5539114Z self=, 2025-05-07T20:32:35.5539194Z T=1, 2025-05-07T20:32:35.5539275Z D=5120, 2025-05-07T20:32:35.5539366Z scale_ub=None, 2025-05-07T20:32:35.5539453Z contiguous=True, 2025-05-07T20:32:35.5539543Z compiled=False, 2025-05-07T20:32:35.5539622Z ) 2025-05-07T20:32:35.5539849Z self = 2025-05-07T20:32:35.5540073Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.5540078Z 2025-05-07T20:32:35.5540165Z @given( 2025-05-07T20:32:35.5540291Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5540397Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5540514Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5540632Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5540755Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5540833Z ) 2025-05-07T20:32:35.5541088Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5541188Z def test_silu_mul_quant( 2025-05-07T20:32:35.5541268Z self, 2025-05-07T20:32:35.5541356Z T: int, 2025-05-07T20:32:35.5541437Z D: int, 2025-05-07T20:32:35.5541537Z scale_ub: Optional[float], 2025-05-07T20:32:35.5541634Z contiguous: bool, 2025-05-07T20:32:35.5541723Z compiled: bool, 2025-05-07T20:32:35.5541807Z ) -> None: 2025-05-07T20:32:35.5541908Z torch.manual_seed(2025) 2025-05-07T20:32:35.5541982Z 2025-05-07T20:32:35.5542200Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5542281Z 2025-05-07T20:32:35.5542378Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5542506Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5542641Z x = x_sign * x_clamp 2025-05-07T20:32:35.5542722Z x0 = x[:, :D] 2025-05-07T20:32:35.5542803Z x1 = x[:, D:] 2025-05-07T20:32:35.5542882Z 2025-05-07T20:32:35.5542968Z if contiguous: 2025-05-07T20:32:35.5543069Z x0 = x0.contiguous() 2025-05-07T20:32:35.5543164Z x1 = x1.contiguous() 2025-05-07T20:32:35.5543282Z 2025-05-07T20:32:35.5543379Z if scale_ub is not None: 2025-05-07T20:32:35.5543487Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5543631Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5543715Z ) 2025-05-07T20:32:35.5543796Z else: 2025-05-07T20:32:35.5543892Z scale_ub_tensor = None 2025-05-07T20:32:35.5543971Z 2025-05-07T20:32:35.5544101Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5544192Z op = silu_mul_quant 2025-05-07T20:32:35.5544280Z if compiled: 2025-05-07T20:32:35.5544384Z op = torch.compile(op) 2025-05-07T20:32:35.5544494Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5544569Z 2025-05-07T20:32:35.5544664Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5544668Z 2025-05-07T20:32:35.5544771Z moe/activation_test.py:117: 2025-05-07T20:32:35.5544903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5545007Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5545112Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5545627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5545736Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5546109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5546339Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5546697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5546793Z kernel = self.compile( 2025-05-07T20:32:35.5547185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5547371Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5547503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5547507Z 2025-05-07T20:32:35.5547768Z self = 2025-05-07T20:32:35.5548574Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5549091Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac82a2b60>} 2025-05-07T20:32:35.5549870Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5550067Z context = 2025-05-07T20:32:35.5550071Z 2025-05-07T20:32:35.5550248Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5550517Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5550691Z module_map=module_map) 2025-05-07T20:32:35.5550860Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5550960Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5551043Z E ^ 2025-05-07T20:32:35.5551450Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5551454Z 2025-05-07T20:32:35.5551880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5551923Z 2025-05-07T20:32:35.5552031Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5552259Z self=, 2025-05-07T20:32:35.5552342Z T=128, 2025-05-07T20:32:35.5552418Z D=5120, 2025-05-07T20:32:35.5552505Z scale_ub=None, 2025-05-07T20:32:35.5552594Z contiguous=False, 2025-05-07T20:32:35.5552677Z compiled=True, 2025-05-07T20:32:35.5552750Z ) 2025-05-07T20:32:35.5552985Z self = 2025-05-07T20:32:35.5553164Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.5553171Z 2025-05-07T20:32:35.5553248Z @given( 2025-05-07T20:32:35.5553371Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5553470Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5553589Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5553706Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5553822Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5553898Z ) 2025-05-07T20:32:35.5554148Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5554243Z def test_silu_mul_quant( 2025-05-07T20:32:35.5554325Z self, 2025-05-07T20:32:35.5554402Z T: int, 2025-05-07T20:32:35.5554476Z D: int, 2025-05-07T20:32:35.5554579Z scale_ub: Optional[float], 2025-05-07T20:32:35.5554667Z contiguous: bool, 2025-05-07T20:32:35.5554753Z compiled: bool, 2025-05-07T20:32:35.5554841Z ) -> None: 2025-05-07T20:32:35.5554938Z torch.manual_seed(2025) 2025-05-07T20:32:35.5555014Z 2025-05-07T20:32:35.5555184Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5555257Z 2025-05-07T20:32:35.5555353Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5555480Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5555571Z x = x_sign * x_clamp 2025-05-07T20:32:35.5555654Z x0 = x[:, :D] 2025-05-07T20:32:35.5555733Z x1 = x[:, D:] 2025-05-07T20:32:35.5555810Z 2025-05-07T20:32:35.5555924Z if contiguous: 2025-05-07T20:32:35.5556107Z x0 = x0.contiguous() 2025-05-07T20:32:35.5556213Z x1 = x1.contiguous() 2025-05-07T20:32:35.5556287Z 2025-05-07T20:32:35.5556380Z if scale_ub is not None: 2025-05-07T20:32:35.5556487Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5556629Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5556704Z ) 2025-05-07T20:32:35.5556785Z else: 2025-05-07T20:32:35.5556880Z scale_ub_tensor = None 2025-05-07T20:32:35.5556951Z 2025-05-07T20:32:35.5557085Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5557175Z op = silu_mul_quant 2025-05-07T20:32:35.5557258Z if compiled: 2025-05-07T20:32:35.5557365Z op = torch.compile(op) 2025-05-07T20:32:35.5557468Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5557539Z 2025-05-07T20:32:35.5557633Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5557638Z 2025-05-07T20:32:35.5557737Z moe/activation_test.py:117: 2025-05-07T20:32:35.5557870Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5558020Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5558122Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5558504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.5558637Z return fn(*args, **kwargs) 2025-05-07T20:32:35.5559149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5559251Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5559620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5559891Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5560245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5560342Z kernel = self.compile( 2025-05-07T20:32:35.5560747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5560925Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5561060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5561069Z 2025-05-07T20:32:35.5561280Z self = 2025-05-07T20:32:35.5562084Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5562611Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac82a2de0>} 2025-05-07T20:32:35.5563433Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5563630Z context = 2025-05-07T20:32:35.5563637Z 2025-05-07T20:32:35.5563804Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5564073Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5564182Z module_map=module_map) 2025-05-07T20:32:35.5564347Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5564450Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5564525Z E ^ 2025-05-07T20:32:35.5564935Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5564940Z 2025-05-07T20:32:35.5565378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5565382Z 2025-05-07T20:32:35.5565486Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5565716Z self=, 2025-05-07T20:32:35.5565802Z T=128, 2025-05-07T20:32:35.5565877Z D=7168, 2025-05-07T20:32:35.5565961Z scale_ub=1200.0, 2025-05-07T20:32:35.5566047Z contiguous=False, 2025-05-07T20:32:35.5566130Z compiled=False, 2025-05-07T20:32:35.5566206Z ) 2025-05-07T20:32:35.5566430Z self = 2025-05-07T20:32:35.5566612Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.5566616Z 2025-05-07T20:32:35.5566693Z @given( 2025-05-07T20:32:35.5566815Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5566912Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5567073Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5567190Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5567308Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5567382Z ) 2025-05-07T20:32:35.5567674Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5567770Z def test_silu_mul_quant( 2025-05-07T20:32:35.5567845Z self, 2025-05-07T20:32:35.5567923Z T: int, 2025-05-07T20:32:35.5568002Z D: int, 2025-05-07T20:32:35.5568098Z scale_ub: Optional[float], 2025-05-07T20:32:35.5568229Z contiguous: bool, 2025-05-07T20:32:35.5568320Z compiled: bool, 2025-05-07T20:32:35.5568398Z ) -> None: 2025-05-07T20:32:35.5568492Z torch.manual_seed(2025) 2025-05-07T20:32:35.5568567Z 2025-05-07T20:32:35.5568739Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5568815Z 2025-05-07T20:32:35.5568908Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5569033Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5569125Z x = x_sign * x_clamp 2025-05-07T20:32:35.5569206Z x0 = x[:, :D] 2025-05-07T20:32:35.5569288Z x1 = x[:, D:] 2025-05-07T20:32:35.5569363Z 2025-05-07T20:32:35.5569446Z if contiguous: 2025-05-07T20:32:35.5569538Z x0 = x0.contiguous() 2025-05-07T20:32:35.5569630Z x1 = x1.contiguous() 2025-05-07T20:32:35.5569702Z 2025-05-07T20:32:35.5569791Z if scale_ub is not None: 2025-05-07T20:32:35.5569903Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5570039Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5570120Z ) 2025-05-07T20:32:35.5570197Z else: 2025-05-07T20:32:35.5570293Z scale_ub_tensor = None 2025-05-07T20:32:35.5570371Z 2025-05-07T20:32:35.5570498Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5570588Z op = silu_mul_quant 2025-05-07T20:32:35.5570677Z if compiled: 2025-05-07T20:32:35.5570778Z op = torch.compile(op) 2025-05-07T20:32:35.5570881Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5570960Z 2025-05-07T20:32:35.5571052Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5571057Z 2025-05-07T20:32:35.5571152Z moe/activation_test.py:117: 2025-05-07T20:32:35.5571288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5571389Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5571493Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5572089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5572241Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5572624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5572858Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5573212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5573315Z kernel = self.compile( 2025-05-07T20:32:35.5573713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5573897Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5574032Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5574037Z 2025-05-07T20:32:35.5574249Z self = 2025-05-07T20:32:35.5575110Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5575637Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8aee57ccc0>} 2025-05-07T20:32:35.5576462Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5576657Z context = 2025-05-07T20:32:35.5576699Z 2025-05-07T20:32:35.5576878Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5577152Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5577261Z module_map=module_map) 2025-05-07T20:32:35.5577427Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5577535Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5577613Z E ^ 2025-05-07T20:32:35.5577988Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5577995Z 2025-05-07T20:32:35.5578425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5578429Z 2025-05-07T20:32:35.5578537Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5578768Z self=, 2025-05-07T20:32:35.5578850Z T=128, 2025-05-07T20:32:35.5578932Z D=5120, 2025-05-07T20:32:35.5579015Z scale_ub=None, 2025-05-07T20:32:35.5579103Z contiguous=False, 2025-05-07T20:32:35.5579195Z compiled=False, 2025-05-07T20:32:35.5579267Z ) 2025-05-07T20:32:35.5579495Z self = 2025-05-07T20:32:35.5579682Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.5579686Z 2025-05-07T20:32:35.5579764Z @given( 2025-05-07T20:32:35.5579891Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5579993Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5580108Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5580231Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5580346Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5580421Z ) 2025-05-07T20:32:35.5580680Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5580776Z def test_silu_mul_quant( 2025-05-07T20:32:35.5580857Z self, 2025-05-07T20:32:35.5580936Z T: int, 2025-05-07T20:32:35.5581085Z D: int, 2025-05-07T20:32:35.5581188Z scale_ub: Optional[float], 2025-05-07T20:32:35.5581277Z contiguous: bool, 2025-05-07T20:32:35.5581365Z compiled: bool, 2025-05-07T20:32:35.5581451Z ) -> None: 2025-05-07T20:32:35.5581547Z torch.manual_seed(2025) 2025-05-07T20:32:35.5581621Z 2025-05-07T20:32:35.5581798Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5581876Z 2025-05-07T20:32:35.5581967Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5582100Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5582191Z x = x_sign * x_clamp 2025-05-07T20:32:35.5582270Z x0 = x[:, :D] 2025-05-07T20:32:35.5582356Z x1 = x[:, D:] 2025-05-07T20:32:35.5582430Z 2025-05-07T20:32:35.5582516Z if contiguous: 2025-05-07T20:32:35.5582608Z x0 = x0.contiguous() 2025-05-07T20:32:35.5582699Z x1 = x1.contiguous() 2025-05-07T20:32:35.5582779Z 2025-05-07T20:32:35.5582883Z if scale_ub is not None: 2025-05-07T20:32:35.5583003Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5583214Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5583291Z ) 2025-05-07T20:32:35.5583367Z else: 2025-05-07T20:32:35.5583462Z scale_ub_tensor = None 2025-05-07T20:32:35.5583575Z 2025-05-07T20:32:35.5583705Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5583800Z op = silu_mul_quant 2025-05-07T20:32:35.5583883Z if compiled: 2025-05-07T20:32:35.5583986Z op = torch.compile(op) 2025-05-07T20:32:35.5584089Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5584201Z 2025-05-07T20:32:35.5584293Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5584298Z 2025-05-07T20:32:35.5584396Z moe/activation_test.py:117: 2025-05-07T20:32:35.5584532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5584636Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5584734Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5585251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5585359Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5585731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5585969Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5586322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5586420Z kernel = self.compile( 2025-05-07T20:32:35.5586818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5587001Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5587140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5587146Z 2025-05-07T20:32:35.5587356Z self = 2025-05-07T20:32:35.5588167Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5588696Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8aee58cea0>} 2025-05-07T20:32:35.5589476Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5589723Z context = 2025-05-07T20:32:35.5589728Z 2025-05-07T20:32:35.5589904Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5590176Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5590287Z module_map=module_map) 2025-05-07T20:32:35.5590454Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5590558Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5590635Z E ^ 2025-05-07T20:32:35.5591002Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5591009Z 2025-05-07T20:32:35.5591443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5591448Z 2025-05-07T20:32:35.5591554Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5591786Z self=, 2025-05-07T20:32:35.5591866Z T=128, 2025-05-07T20:32:35.5591986Z D=5120, 2025-05-07T20:32:35.5592077Z scale_ub=1200.0, 2025-05-07T20:32:35.5592163Z contiguous=True, 2025-05-07T20:32:35.5592247Z compiled=False, 2025-05-07T20:32:35.5592325Z ) 2025-05-07T20:32:35.5592595Z self = 2025-05-07T20:32:35.5592773Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.5592777Z 2025-05-07T20:32:35.5592858Z @given( 2025-05-07T20:32:35.5592979Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5593124Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5593237Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5593356Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5593479Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5593556Z ) 2025-05-07T20:32:35.5593808Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5593907Z def test_silu_mul_quant( 2025-05-07T20:32:35.5593983Z self, 2025-05-07T20:32:35.5594062Z T: int, 2025-05-07T20:32:35.5594143Z D: int, 2025-05-07T20:32:35.5594243Z scale_ub: Optional[float], 2025-05-07T20:32:35.5594333Z contiguous: bool, 2025-05-07T20:32:35.5594422Z compiled: bool, 2025-05-07T20:32:35.5594499Z ) -> None: 2025-05-07T20:32:35.5594596Z torch.manual_seed(2025) 2025-05-07T20:32:35.5594667Z 2025-05-07T20:32:35.5594838Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5594918Z 2025-05-07T20:32:35.5595008Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5595130Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5595220Z x = x_sign * x_clamp 2025-05-07T20:32:35.5595301Z x0 = x[:, :D] 2025-05-07T20:32:35.5595380Z x1 = x[:, D:] 2025-05-07T20:32:35.5595454Z 2025-05-07T20:32:35.5595541Z if contiguous: 2025-05-07T20:32:35.5595631Z x0 = x0.contiguous() 2025-05-07T20:32:35.5595727Z x1 = x1.contiguous() 2025-05-07T20:32:35.5595796Z 2025-05-07T20:32:35.5595887Z if scale_ub is not None: 2025-05-07T20:32:35.5595997Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5596137Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5596212Z ) 2025-05-07T20:32:35.5596288Z else: 2025-05-07T20:32:35.5596381Z scale_ub_tensor = None 2025-05-07T20:32:35.5596455Z 2025-05-07T20:32:35.5596586Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5596678Z op = silu_mul_quant 2025-05-07T20:32:35.5596765Z if compiled: 2025-05-07T20:32:35.5596915Z op = torch.compile(op) 2025-05-07T20:32:35.5597023Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5597096Z 2025-05-07T20:32:35.5597187Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5597193Z 2025-05-07T20:32:35.5597292Z moe/activation_test.py:117: 2025-05-07T20:32:35.5597423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5597524Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5597631Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5598150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5598247Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5602630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5602880Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5603247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5603346Z kernel = self.compile( 2025-05-07T20:32:35.5603808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5603991Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5604164Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5604170Z 2025-05-07T20:32:35.5604386Z self = 2025-05-07T20:32:35.5605196Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5605764Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac822cc20>} 2025-05-07T20:32:35.5606829Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5607036Z context = 2025-05-07T20:32:35.5607045Z 2025-05-07T20:32:35.5607222Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5607496Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5607606Z module_map=module_map) 2025-05-07T20:32:35.5607789Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5607892Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5607974Z E ^ 2025-05-07T20:32:35.5608346Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5608351Z 2025-05-07T20:32:35.5608784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5608789Z 2025-05-07T20:32:35.5608897Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5609130Z self=, 2025-05-07T20:32:35.5609211Z T=1, 2025-05-07T20:32:35.5609297Z D=7168, 2025-05-07T20:32:35.5609383Z scale_ub=1200.0, 2025-05-07T20:32:35.5609474Z contiguous=True, 2025-05-07T20:32:35.5609561Z compiled=True, 2025-05-07T20:32:35.5609636Z ) 2025-05-07T20:32:35.5609868Z self = 2025-05-07T20:32:35.5610038Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.5610043Z 2025-05-07T20:32:35.5610121Z @given( 2025-05-07T20:32:35.5610356Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5610458Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5610575Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5610693Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5610806Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5610886Z ) 2025-05-07T20:32:35.5611139Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5611232Z def test_silu_mul_quant( 2025-05-07T20:32:35.5611314Z self, 2025-05-07T20:32:35.5611389Z T: int, 2025-05-07T20:32:35.5611465Z D: int, 2025-05-07T20:32:35.5611568Z scale_ub: Optional[float], 2025-05-07T20:32:35.5611659Z contiguous: bool, 2025-05-07T20:32:35.5611745Z compiled: bool, 2025-05-07T20:32:35.5611890Z ) -> None: 2025-05-07T20:32:35.5611988Z torch.manual_seed(2025) 2025-05-07T20:32:35.5612066Z 2025-05-07T20:32:35.5612240Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5612313Z 2025-05-07T20:32:35.5612478Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5612606Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5612697Z x = x_sign * x_clamp 2025-05-07T20:32:35.5612779Z x0 = x[:, :D] 2025-05-07T20:32:35.5612916Z x1 = x[:, D:] 2025-05-07T20:32:35.5612988Z 2025-05-07T20:32:35.5613076Z if contiguous: 2025-05-07T20:32:35.5613172Z x0 = x0.contiguous() 2025-05-07T20:32:35.5613259Z x1 = x1.contiguous() 2025-05-07T20:32:35.5613334Z 2025-05-07T20:32:35.5613423Z if scale_ub is not None: 2025-05-07T20:32:35.5613616Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5613756Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5613834Z ) 2025-05-07T20:32:35.5613913Z else: 2025-05-07T20:32:35.5614010Z scale_ub_tensor = None 2025-05-07T20:32:35.5614083Z 2025-05-07T20:32:35.5614219Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5614311Z op = silu_mul_quant 2025-05-07T20:32:35.5614397Z if compiled: 2025-05-07T20:32:35.5614501Z op = torch.compile(op) 2025-05-07T20:32:35.5614611Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5614686Z 2025-05-07T20:32:35.5614782Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5614786Z 2025-05-07T20:32:35.5614885Z moe/activation_test.py:117: 2025-05-07T20:32:35.5615017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5615120Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5615224Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5615606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.5615703Z return fn(*args, **kwargs) 2025-05-07T20:32:35.5616219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5616323Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5616691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5616924Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5617281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5617374Z kernel = self.compile( 2025-05-07T20:32:35.5617774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5617957Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5618136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5618141Z 2025-05-07T20:32:35.5618359Z self = 2025-05-07T20:32:35.5619166Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5619697Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac822e2a0>} 2025-05-07T20:32:35.5620472Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5620677Z context = 2025-05-07T20:32:35.5620682Z 2025-05-07T20:32:35.5620854Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5621167Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5621280Z module_map=module_map) 2025-05-07T20:32:35.5621445Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5621584Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5621669Z E ^ 2025-05-07T20:32:35.5622037Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5622042Z 2025-05-07T20:32:35.5622477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5622522Z 2025-05-07T20:32:35.5622626Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5622858Z self=, 2025-05-07T20:32:35.5622944Z T=1, 2025-05-07T20:32:35.5623022Z D=7168, 2025-05-07T20:32:35.5623105Z scale_ub=1200.0, 2025-05-07T20:32:35.5623198Z contiguous=False, 2025-05-07T20:32:35.5623284Z compiled=True, 2025-05-07T20:32:35.5623363Z ) 2025-05-07T20:32:35.5623590Z self = 2025-05-07T20:32:35.5623763Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.5623770Z 2025-05-07T20:32:35.5623852Z @given( 2025-05-07T20:32:35.5623974Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5624074Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5624192Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5624315Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5624429Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5624507Z ) 2025-05-07T20:32:35.5624769Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5624869Z def test_silu_mul_quant( 2025-05-07T20:32:35.5624947Z self, 2025-05-07T20:32:35.5625027Z T: int, 2025-05-07T20:32:35.5625109Z D: int, 2025-05-07T20:32:35.5625209Z scale_ub: Optional[float], 2025-05-07T20:32:35.5625299Z contiguous: bool, 2025-05-07T20:32:35.5625393Z compiled: bool, 2025-05-07T20:32:35.5625474Z ) -> None: 2025-05-07T20:32:35.5625570Z torch.manual_seed(2025) 2025-05-07T20:32:35.5625648Z 2025-05-07T20:32:35.5625820Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5625894Z 2025-05-07T20:32:35.5625993Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5626124Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5626216Z x = x_sign * x_clamp 2025-05-07T20:32:35.5626298Z x0 = x[:, :D] 2025-05-07T20:32:35.5626380Z x1 = x[:, D:] 2025-05-07T20:32:35.5626456Z 2025-05-07T20:32:35.5626588Z if contiguous: 2025-05-07T20:32:35.5626680Z x0 = x0.contiguous() 2025-05-07T20:32:35.5626773Z x1 = x1.contiguous() 2025-05-07T20:32:35.5626848Z 2025-05-07T20:32:35.5626941Z if scale_ub is not None: 2025-05-07T20:32:35.5627052Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5627191Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5627271Z ) 2025-05-07T20:32:35.5627352Z else: 2025-05-07T20:32:35.5627445Z scale_ub_tensor = None 2025-05-07T20:32:35.5627520Z 2025-05-07T20:32:35.5627654Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5627747Z op = silu_mul_quant 2025-05-07T20:32:35.5627837Z if compiled: 2025-05-07T20:32:35.5627935Z op = torch.compile(op) 2025-05-07T20:32:35.5628040Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5628122Z 2025-05-07T20:32:35.5628221Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5628225Z 2025-05-07T20:32:35.5628323Z moe/activation_test.py:117: 2025-05-07T20:32:35.5628502Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5628604Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5628703Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5629083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.5629215Z return fn(*args, **kwargs) 2025-05-07T20:32:35.5629728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5629869Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5630235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5630471Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5630822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5630921Z kernel = self.compile( 2025-05-07T20:32:35.5631315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5631492Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5631634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5631638Z 2025-05-07T20:32:35.5631847Z self = 2025-05-07T20:32:35.5632652Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5633176Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac822f9c0>} 2025-05-07T20:32:35.5633951Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5634153Z context = 2025-05-07T20:32:35.5634157Z 2025-05-07T20:32:35.5634326Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5634599Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5634709Z module_map=module_map) 2025-05-07T20:32:35.5634873Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5634976Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5635053Z E ^ 2025-05-07T20:32:35.5635463Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5635473Z 2025-05-07T20:32:35.5635903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5635907Z 2025-05-07T20:32:35.5636008Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5636243Z self=, 2025-05-07T20:32:35.5636317Z T=1, 2025-05-07T20:32:35.5636392Z D=7168, 2025-05-07T20:32:35.5636481Z scale_ub=None, 2025-05-07T20:32:35.5636567Z contiguous=False, 2025-05-07T20:32:35.5636649Z compiled=True, 2025-05-07T20:32:35.5636728Z ) 2025-05-07T20:32:35.5636953Z self = 2025-05-07T20:32:35.5637124Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.5637132Z 2025-05-07T20:32:35.5637208Z @given( 2025-05-07T20:32:35.5637326Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5637470Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5637587Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5637706Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5637826Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5637940Z ) 2025-05-07T20:32:35.5638196Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5638293Z def test_silu_mul_quant( 2025-05-07T20:32:35.5638370Z self, 2025-05-07T20:32:35.5638448Z T: int, 2025-05-07T20:32:35.5638525Z D: int, 2025-05-07T20:32:35.5638664Z scale_ub: Optional[float], 2025-05-07T20:32:35.5638762Z contiguous: bool, 2025-05-07T20:32:35.5638848Z compiled: bool, 2025-05-07T20:32:35.5638927Z ) -> None: 2025-05-07T20:32:35.5639030Z torch.manual_seed(2025) 2025-05-07T20:32:35.5639105Z 2025-05-07T20:32:35.5639273Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5639351Z 2025-05-07T20:32:35.5639442Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5639567Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5639657Z x = x_sign * x_clamp 2025-05-07T20:32:35.5639740Z x0 = x[:, :D] 2025-05-07T20:32:35.5639823Z x1 = x[:, D:] 2025-05-07T20:32:35.5639897Z 2025-05-07T20:32:35.5639980Z if contiguous: 2025-05-07T20:32:35.5640074Z x0 = x0.contiguous() 2025-05-07T20:32:35.5640163Z x1 = x1.contiguous() 2025-05-07T20:32:35.5640236Z 2025-05-07T20:32:35.5640331Z if scale_ub is not None: 2025-05-07T20:32:35.5640435Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5640570Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5640649Z ) 2025-05-07T20:32:35.5640729Z else: 2025-05-07T20:32:35.5640823Z scale_ub_tensor = None 2025-05-07T20:32:35.5640898Z 2025-05-07T20:32:35.5641028Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5641120Z op = silu_mul_quant 2025-05-07T20:32:35.5641207Z if compiled: 2025-05-07T20:32:35.5641304Z op = torch.compile(op) 2025-05-07T20:32:35.5641414Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5641487Z 2025-05-07T20:32:35.5641580Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.5641702Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.5641774Z 2025-05-07T20:32:35.5641909Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5642025Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.5642125Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.5642246Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.5642438Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.5642515Z 2025-05-07T20:32:35.5642618Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.5642625Z 2025-05-07T20:32:35.5642725Z moe/activation_test.py:126: 2025-05-07T20:32:35.5642856Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5642967Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.5643100Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.5643726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.5643831Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.5644203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5644435Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5644812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.5645117Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.5645511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.5645739Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.5646094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.5646172Z fn() 2025-05-07T20:32:35.5646587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.5646714Z self.fn.run( 2025-05-07T20:32:35.5647063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5647161Z kernel = self.compile( 2025-05-07T20:32:35.5647560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5647737Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5647871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5647879Z 2025-05-07T20:32:35.5648087Z self = 2025-05-07T20:32:35.5648888Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5649412Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac8130b80>} 2025-05-07T20:32:35.5650194Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5650395Z context = 2025-05-07T20:32:35.5650400Z 2025-05-07T20:32:35.5650568Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5650844Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5650954Z module_map=module_map) 2025-05-07T20:32:35.5651118Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5651226Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.5651303Z E ^ 2025-05-07T20:32:35.5651675Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5651721Z 2025-05-07T20:32:35.5652232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5652240Z 2025-05-07T20:32:35.5652344Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5652576Z self=, 2025-05-07T20:32:35.5652653Z T=1, 2025-05-07T20:32:35.5652731Z D=5120, 2025-05-07T20:32:35.5652816Z scale_ub=1200.0, 2025-05-07T20:32:35.5652902Z contiguous=False, 2025-05-07T20:32:35.5652985Z compiled=True, 2025-05-07T20:32:35.5653061Z ) 2025-05-07T20:32:35.5653284Z self = 2025-05-07T20:32:35.5653486Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.5653491Z 2025-05-07T20:32:35.5653584Z @given( 2025-05-07T20:32:35.5653713Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5653817Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5653931Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5654094Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5654213Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5654285Z ) 2025-05-07T20:32:35.5654537Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5654672Z def test_silu_mul_quant( 2025-05-07T20:32:35.5654748Z self, 2025-05-07T20:32:35.5654826Z T: int, 2025-05-07T20:32:35.5654906Z D: int, 2025-05-07T20:32:35.5655003Z scale_ub: Optional[float], 2025-05-07T20:32:35.5655090Z contiguous: bool, 2025-05-07T20:32:35.5655295Z compiled: bool, 2025-05-07T20:32:35.5655373Z ) -> None: 2025-05-07T20:32:35.5655469Z torch.manual_seed(2025) 2025-05-07T20:32:35.5655540Z 2025-05-07T20:32:35.5655714Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5655792Z 2025-05-07T20:32:35.5655888Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5656011Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5656104Z x = x_sign * x_clamp 2025-05-07T20:32:35.5656183Z x0 = x[:, :D] 2025-05-07T20:32:35.5656267Z x1 = x[:, D:] 2025-05-07T20:32:35.5656342Z 2025-05-07T20:32:35.5656428Z if contiguous: 2025-05-07T20:32:35.5656518Z x0 = x0.contiguous() 2025-05-07T20:32:35.5656622Z x1 = x1.contiguous() 2025-05-07T20:32:35.5656723Z 2025-05-07T20:32:35.5656850Z if scale_ub is not None: 2025-05-07T20:32:35.5656988Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5657125Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5657207Z ) 2025-05-07T20:32:35.5657283Z else: 2025-05-07T20:32:35.5657378Z scale_ub_tensor = None 2025-05-07T20:32:35.5657455Z 2025-05-07T20:32:35.5657588Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5657677Z op = silu_mul_quant 2025-05-07T20:32:35.5657764Z if compiled: 2025-05-07T20:32:35.5657865Z op = torch.compile(op) 2025-05-07T20:32:35.5657968Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5658041Z 2025-05-07T20:32:35.5658136Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5658143Z 2025-05-07T20:32:35.5658243Z moe/activation_test.py:117: 2025-05-07T20:32:35.5658375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5658475Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5658576Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5658958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.5659051Z return fn(*args, **kwargs) 2025-05-07T20:32:35.5659622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5659721Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5660097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5660324Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5660679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5660795Z kernel = self.compile( 2025-05-07T20:32:35.5661327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5661511Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5661645Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5661650Z 2025-05-07T20:32:35.5661860Z self = 2025-05-07T20:32:35.5662733Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5663257Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac8131e40>} 2025-05-07T20:32:35.5664081Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5664316Z context = 2025-05-07T20:32:35.5664320Z 2025-05-07T20:32:35.5664490Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5664770Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5664877Z module_map=module_map) 2025-05-07T20:32:35.5665040Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5665145Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5665220Z E ^ 2025-05-07T20:32:35.5665599Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5665603Z 2025-05-07T20:32:35.5666038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5666042Z 2025-05-07T20:32:35.5666147Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5666382Z self=, 2025-05-07T20:32:35.5666458Z T=1, 2025-05-07T20:32:35.5666536Z D=5120, 2025-05-07T20:32:35.5666623Z scale_ub=1200.0, 2025-05-07T20:32:35.5666708Z contiguous=False, 2025-05-07T20:32:35.5666798Z compiled=False, 2025-05-07T20:32:35.5666869Z ) 2025-05-07T20:32:35.5667100Z self = 2025-05-07T20:32:35.5667282Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.5667286Z 2025-05-07T20:32:35.5667366Z @given( 2025-05-07T20:32:35.5667484Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5667584Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5667697Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5667813Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5667933Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5668006Z ) 2025-05-07T20:32:35.5668263Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5668401Z def test_silu_mul_quant( 2025-05-07T20:32:35.5668477Z self, 2025-05-07T20:32:35.5668562Z T: int, 2025-05-07T20:32:35.5668637Z D: int, 2025-05-07T20:32:35.5668737Z scale_ub: Optional[float], 2025-05-07T20:32:35.5668828Z contiguous: bool, 2025-05-07T20:32:35.5668912Z compiled: bool, 2025-05-07T20:32:35.5668987Z ) -> None: 2025-05-07T20:32:35.5669085Z torch.manual_seed(2025) 2025-05-07T20:32:35.5669160Z 2025-05-07T20:32:35.5669330Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5669406Z 2025-05-07T20:32:35.5669497Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5669625Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5669714Z x = x_sign * x_clamp 2025-05-07T20:32:35.5669796Z x0 = x[:, :D] 2025-05-07T20:32:35.5669877Z x1 = x[:, D:] 2025-05-07T20:32:35.5669948Z 2025-05-07T20:32:35.5670030Z if contiguous: 2025-05-07T20:32:35.5670127Z x0 = x0.contiguous() 2025-05-07T20:32:35.5670215Z x1 = x1.contiguous() 2025-05-07T20:32:35.5670284Z 2025-05-07T20:32:35.5670376Z if scale_ub is not None: 2025-05-07T20:32:35.5670524Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5670663Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5670745Z ) 2025-05-07T20:32:35.5670859Z else: 2025-05-07T20:32:35.5670952Z scale_ub_tensor = None 2025-05-07T20:32:35.5671027Z 2025-05-07T20:32:35.5671154Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5671246Z op = silu_mul_quant 2025-05-07T20:32:35.5671329Z if compiled: 2025-05-07T20:32:35.5671427Z op = torch.compile(op) 2025-05-07T20:32:35.5672300Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5672373Z 2025-05-07T20:32:35.5672462Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5672466Z 2025-05-07T20:32:35.5672578Z moe/activation_test.py:117: 2025-05-07T20:32:35.5672728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5672854Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5672956Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5673478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5673587Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5673960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5674190Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5674556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5674652Z kernel = self.compile( 2025-05-07T20:32:35.5675055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5675234Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5675372Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5675376Z 2025-05-07T20:32:35.5675591Z self = 2025-05-07T20:32:35.5676405Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5676933Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac8132ac0>} 2025-05-07T20:32:35.5677759Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5677958Z context = 2025-05-07T20:32:35.5677963Z 2025-05-07T20:32:35.5678138Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5678411Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5678524Z module_map=module_map) 2025-05-07T20:32:35.5678688Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5678788Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5678867Z E ^ 2025-05-07T20:32:35.5679238Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5679246Z 2025-05-07T20:32:35.5679681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5679688Z 2025-05-07T20:32:35.5679790Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5680087Z self=, 2025-05-07T20:32:35.5680176Z T=16384, 2025-05-07T20:32:35.5680254Z D=5120, 2025-05-07T20:32:35.5680340Z scale_ub=1200.0, 2025-05-07T20:32:35.5680429Z contiguous=False, 2025-05-07T20:32:35.5680551Z compiled=True, 2025-05-07T20:32:35.5680625Z ) 2025-05-07T20:32:35.5680855Z self = 2025-05-07T20:32:35.5681043Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.5681047Z 2025-05-07T20:32:35.5681166Z @given( 2025-05-07T20:32:35.5681287Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5681387Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5681508Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5681630Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5681745Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5681822Z ) 2025-05-07T20:32:35.5682077Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5682170Z def test_silu_mul_quant( 2025-05-07T20:32:35.5682251Z self, 2025-05-07T20:32:35.5682332Z T: int, 2025-05-07T20:32:35.5682407Z D: int, 2025-05-07T20:32:35.5682510Z scale_ub: Optional[float], 2025-05-07T20:32:35.5682599Z contiguous: bool, 2025-05-07T20:32:35.5682711Z compiled: bool, 2025-05-07T20:32:35.5682794Z ) -> None: 2025-05-07T20:32:35.5682913Z torch.manual_seed(2025) 2025-05-07T20:32:35.5682995Z 2025-05-07T20:32:35.5683167Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5683242Z 2025-05-07T20:32:35.5683339Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5683467Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5683557Z x = x_sign * x_clamp 2025-05-07T20:32:35.5683640Z x0 = x[:, :D] 2025-05-07T20:32:35.5683721Z x1 = x[:, D:] 2025-05-07T20:32:35.5683792Z 2025-05-07T20:32:35.5683879Z if contiguous: 2025-05-07T20:32:35.5683971Z x0 = x0.contiguous() 2025-05-07T20:32:35.5684060Z x1 = x1.contiguous() 2025-05-07T20:32:35.5684133Z 2025-05-07T20:32:35.5684223Z if scale_ub is not None: 2025-05-07T20:32:35.5684333Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5684470Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5684544Z ) 2025-05-07T20:32:35.5684623Z else: 2025-05-07T20:32:35.5684720Z scale_ub_tensor = None 2025-05-07T20:32:35.5684790Z 2025-05-07T20:32:35.5684920Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5685010Z op = silu_mul_quant 2025-05-07T20:32:35.5685140Z if compiled: 2025-05-07T20:32:35.5685247Z op = torch.compile(op) 2025-05-07T20:32:35.5685354Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5685426Z 2025-05-07T20:32:35.5685515Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5685520Z 2025-05-07T20:32:35.5685617Z moe/activation_test.py:117: 2025-05-07T20:32:35.5685753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5685855Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5685955Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5686334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.5686427Z return fn(*args, **kwargs) 2025-05-07T20:32:35.5686943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5687041Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5687410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5687685Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5688035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5688168Z kernel = self.compile( 2025-05-07T20:32:35.5688570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5688745Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5688876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5688920Z 2025-05-07T20:32:35.5689128Z self = 2025-05-07T20:32:35.5689937Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5690462Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917e4c180>} 2025-05-07T20:32:35.5691239Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5691433Z context = 2025-05-07T20:32:35.5691440Z 2025-05-07T20:32:35.5691605Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5691940Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5692055Z module_map=module_map) 2025-05-07T20:32:35.5692215Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5692321Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5692396Z E ^ 2025-05-07T20:32:35.5692762Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5692769Z 2025-05-07T20:32:35.5693203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5693208Z 2025-05-07T20:32:35.5693309Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5693540Z self=, 2025-05-07T20:32:35.5693621Z T=2048, 2025-05-07T20:32:35.5693697Z D=7168, 2025-05-07T20:32:35.5693783Z scale_ub=1200.0, 2025-05-07T20:32:35.5693867Z contiguous=False, 2025-05-07T20:32:35.5693948Z compiled=True, 2025-05-07T20:32:35.5694073Z ) 2025-05-07T20:32:35.5694298Z self = 2025-05-07T20:32:35.5694483Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.5694487Z 2025-05-07T20:32:35.5694564Z @given( 2025-05-07T20:32:35.5694683Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5694785Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5694900Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5695015Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5695132Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5695206Z ) 2025-05-07T20:32:35.5695461Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5695555Z def test_silu_mul_quant( 2025-05-07T20:32:35.5695630Z self, 2025-05-07T20:32:35.5695706Z T: int, 2025-05-07T20:32:35.5695788Z D: int, 2025-05-07T20:32:35.5695884Z scale_ub: Optional[float], 2025-05-07T20:32:35.5695973Z contiguous: bool, 2025-05-07T20:32:35.5696060Z compiled: bool, 2025-05-07T20:32:35.5696182Z ) -> None: 2025-05-07T20:32:35.5696281Z torch.manual_seed(2025) 2025-05-07T20:32:35.5696352Z 2025-05-07T20:32:35.5696522Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5696634Z 2025-05-07T20:32:35.5696726Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5696851Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5696942Z x = x_sign * x_clamp 2025-05-07T20:32:35.5697020Z x0 = x[:, :D] 2025-05-07T20:32:35.5697099Z x1 = x[:, D:] 2025-05-07T20:32:35.5697216Z 2025-05-07T20:32:35.5697300Z if contiguous: 2025-05-07T20:32:35.5697389Z x0 = x0.contiguous() 2025-05-07T20:32:35.5697481Z x1 = x1.contiguous() 2025-05-07T20:32:35.5697551Z 2025-05-07T20:32:35.5697647Z if scale_ub is not None: 2025-05-07T20:32:35.5697752Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5697892Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5697971Z ) 2025-05-07T20:32:35.5698049Z else: 2025-05-07T20:32:35.5698145Z scale_ub_tensor = None 2025-05-07T20:32:35.5698218Z 2025-05-07T20:32:35.5698349Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5698438Z op = silu_mul_quant 2025-05-07T20:32:35.5698526Z if compiled: 2025-05-07T20:32:35.5698623Z op = torch.compile(op) 2025-05-07T20:32:35.5698726Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5698805Z 2025-05-07T20:32:35.5698894Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5698898Z 2025-05-07T20:32:35.5698996Z moe/activation_test.py:117: 2025-05-07T20:32:35.5699130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5699230Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5699330Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5699713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.5699807Z return fn(*args, **kwargs) 2025-05-07T20:32:35.5700325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5700427Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5700798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5701026Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5701381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5701484Z kernel = self.compile( 2025-05-07T20:32:35.5701926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5702109Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5702245Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5702249Z 2025-05-07T20:32:35.5702460Z self = 2025-05-07T20:32:35.5703276Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5703797Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917e4cea0>} 2025-05-07T20:32:35.5704582Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5704818Z context = 2025-05-07T20:32:35.5704823Z 2025-05-07T20:32:35.5704992Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5705306Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5705413Z module_map=module_map) 2025-05-07T20:32:35.5705579Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5705679Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5705796Z E ^ 2025-05-07T20:32:35.5706341Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5706349Z 2025-05-07T20:32:35.5706850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5706855Z 2025-05-07T20:32:35.5706963Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5707200Z self=, 2025-05-07T20:32:35.5707277Z T=1, 2025-05-07T20:32:35.5707359Z D=5120, 2025-05-07T20:32:35.5707445Z scale_ub=None, 2025-05-07T20:32:35.5707531Z contiguous=False, 2025-05-07T20:32:35.5707618Z compiled=False, 2025-05-07T20:32:35.5707691Z ) 2025-05-07T20:32:35.5707919Z self = 2025-05-07T20:32:35.5708095Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.5708102Z 2025-05-07T20:32:35.5708180Z @given( 2025-05-07T20:32:35.5708298Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5708401Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5708517Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5708636Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5708751Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5708824Z ) 2025-05-07T20:32:35.5709078Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5709171Z def test_silu_mul_quant( 2025-05-07T20:32:35.5709249Z self, 2025-05-07T20:32:35.5709328Z T: int, 2025-05-07T20:32:35.5709403Z D: int, 2025-05-07T20:32:35.5709499Z scale_ub: Optional[float], 2025-05-07T20:32:35.5709592Z contiguous: bool, 2025-05-07T20:32:35.5709678Z compiled: bool, 2025-05-07T20:32:35.5709757Z ) -> None: 2025-05-07T20:32:35.5709857Z torch.manual_seed(2025) 2025-05-07T20:32:35.5709927Z 2025-05-07T20:32:35.5710100Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5710175Z 2025-05-07T20:32:35.5710382Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5710510Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5710597Z x = x_sign * x_clamp 2025-05-07T20:32:35.5710678Z x0 = x[:, :D] 2025-05-07T20:32:35.5710759Z x1 = x[:, D:] 2025-05-07T20:32:35.5710831Z 2025-05-07T20:32:35.5710915Z if contiguous: 2025-05-07T20:32:35.5711008Z x0 = x0.contiguous() 2025-05-07T20:32:35.5711098Z x1 = x1.contiguous() 2025-05-07T20:32:35.5711171Z 2025-05-07T20:32:35.5711265Z if scale_ub is not None: 2025-05-07T20:32:35.5711367Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5711505Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5711582Z ) 2025-05-07T20:32:35.5711657Z else: 2025-05-07T20:32:35.5711753Z scale_ub_tensor = None 2025-05-07T20:32:35.5711825Z 2025-05-07T20:32:35.5711954Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5712050Z op = silu_mul_quant 2025-05-07T20:32:35.5712133Z if compiled: 2025-05-07T20:32:35.5712230Z op = torch.compile(op) 2025-05-07T20:32:35.5712399Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5712473Z 2025-05-07T20:32:35.5712563Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5712568Z 2025-05-07T20:32:35.5712722Z moe/activation_test.py:117: 2025-05-07T20:32:35.5712853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5712955Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5713055Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5713567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5713725Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5714098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5714324Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5714681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5714775Z kernel = self.compile( 2025-05-07T20:32:35.5715175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5715353Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5715483Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5715487Z 2025-05-07T20:32:35.5715697Z self = 2025-05-07T20:32:35.5716506Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5717033Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917e4de40>} 2025-05-07T20:32:35.5717811Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5718012Z context = 2025-05-07T20:32:35.5718016Z 2025-05-07T20:32:35.5718184Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5718458Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5718567Z module_map=module_map) 2025-05-07T20:32:35.5718772Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5718871Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5718950Z E ^ 2025-05-07T20:32:35.5719317Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5719321Z 2025-05-07T20:32:35.5719754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5719761Z 2025-05-07T20:32:35.5719863Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5720091Z self=, 2025-05-07T20:32:35.5720171Z T=4096, 2025-05-07T20:32:35.5720247Z D=7168, 2025-05-07T20:32:35.5720334Z scale_ub=1200.0, 2025-05-07T20:32:35.5720427Z contiguous=False, 2025-05-07T20:32:35.5720511Z compiled=False, 2025-05-07T20:32:35.5720582Z ) 2025-05-07T20:32:35.5720810Z self = 2025-05-07T20:32:35.5720998Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.5721002Z 2025-05-07T20:32:35.5721081Z @given( 2025-05-07T20:32:35.5721241Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5721343Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5721461Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5721617Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5721729Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5721805Z ) 2025-05-07T20:32:35.5722055Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5722152Z def test_silu_mul_quant( 2025-05-07T20:32:35.5722276Z self, 2025-05-07T20:32:35.5722351Z T: int, 2025-05-07T20:32:35.5722431Z D: int, 2025-05-07T20:32:35.5722528Z scale_ub: Optional[float], 2025-05-07T20:32:35.5722616Z contiguous: bool, 2025-05-07T20:32:35.5722707Z compiled: bool, 2025-05-07T20:32:35.5722784Z ) -> None: 2025-05-07T20:32:35.5722876Z torch.manual_seed(2025) 2025-05-07T20:32:35.5722958Z 2025-05-07T20:32:35.5723127Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5723199Z 2025-05-07T20:32:35.5723292Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5723419Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5723510Z x = x_sign * x_clamp 2025-05-07T20:32:35.5723588Z x0 = x[:, :D] 2025-05-07T20:32:35.5723667Z x1 = x[:, D:] 2025-05-07T20:32:35.5723741Z 2025-05-07T20:32:35.5723824Z if contiguous: 2025-05-07T20:32:35.5723913Z x0 = x0.contiguous() 2025-05-07T20:32:35.5724010Z x1 = x1.contiguous() 2025-05-07T20:32:35.5724080Z 2025-05-07T20:32:35.5724170Z if scale_ub is not None: 2025-05-07T20:32:35.5724274Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5724410Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5724484Z ) 2025-05-07T20:32:35.5724562Z else: 2025-05-07T20:32:35.5724659Z scale_ub_tensor = None 2025-05-07T20:32:35.5724731Z 2025-05-07T20:32:35.5724861Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5724950Z op = silu_mul_quant 2025-05-07T20:32:35.5725038Z if compiled: 2025-05-07T20:32:35.5725135Z op = torch.compile(op) 2025-05-07T20:32:35.5725240Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5725314Z 2025-05-07T20:32:35.5725403Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5725408Z 2025-05-07T20:32:35.5725505Z moe/activation_test.py:117: 2025-05-07T20:32:35.5725642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5725744Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5725891Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5729843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5729961Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5730344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5730576Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5730929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5731025Z kernel = self.compile( 2025-05-07T20:32:35.5731421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5731607Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5731739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5731744Z 2025-05-07T20:32:35.5732039Z self = 2025-05-07T20:32:35.5732923Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5733586Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917e4f380>} 2025-05-07T20:32:35.5734517Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5734772Z context = 2025-05-07T20:32:35.5734777Z 2025-05-07T20:32:35.5734967Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5735279Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5735389Z module_map=module_map) 2025-05-07T20:32:35.5735565Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5735668Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5735747Z E ^ 2025-05-07T20:32:35.5736175Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5736180Z 2025-05-07T20:32:35.5736679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5736687Z 2025-05-07T20:32:35.5736797Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5737053Z self=, 2025-05-07T20:32:35.5737133Z T=16384, 2025-05-07T20:32:35.5737214Z D=7168, 2025-05-07T20:32:35.5737298Z scale_ub=None, 2025-05-07T20:32:35.5737384Z contiguous=True, 2025-05-07T20:32:35.5737472Z compiled=True, 2025-05-07T20:32:35.5737545Z ) 2025-05-07T20:32:35.5737794Z self = 2025-05-07T20:32:35.5737990Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.5737999Z 2025-05-07T20:32:35.5738076Z @given( 2025-05-07T20:32:35.5738202Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5738306Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5738425Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5738555Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5738674Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5738747Z ) 2025-05-07T20:32:35.5739079Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5739176Z def test_silu_mul_quant( 2025-05-07T20:32:35.5739251Z self, 2025-05-07T20:32:35.5739331Z T: int, 2025-05-07T20:32:35.5739408Z D: int, 2025-05-07T20:32:35.5739513Z scale_ub: Optional[float], 2025-05-07T20:32:35.5739604Z contiguous: bool, 2025-05-07T20:32:35.5739691Z compiled: bool, 2025-05-07T20:32:35.5739775Z ) -> None: 2025-05-07T20:32:35.5739870Z torch.manual_seed(2025) 2025-05-07T20:32:35.5739944Z 2025-05-07T20:32:35.5740132Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5740206Z 2025-05-07T20:32:35.5740302Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5740438Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5740527Z x = x_sign * x_clamp 2025-05-07T20:32:35.5740607Z x0 = x[:, :D] 2025-05-07T20:32:35.5740690Z x1 = x[:, D:] 2025-05-07T20:32:35.5740762Z 2025-05-07T20:32:35.5740851Z if contiguous: 2025-05-07T20:32:35.5740942Z x0 = x0.contiguous() 2025-05-07T20:32:35.5741032Z x1 = x1.contiguous() 2025-05-07T20:32:35.5741155Z 2025-05-07T20:32:35.5741246Z if scale_ub is not None: 2025-05-07T20:32:35.5741351Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5741489Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5741602Z ) 2025-05-07T20:32:35.5741677Z else: 2025-05-07T20:32:35.5741775Z scale_ub_tensor = None 2025-05-07T20:32:35.5741845Z 2025-05-07T20:32:35.5741974Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5742067Z op = silu_mul_quant 2025-05-07T20:32:35.5742217Z if compiled: 2025-05-07T20:32:35.5742316Z op = torch.compile(op) 2025-05-07T20:32:35.5742425Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5742496Z 2025-05-07T20:32:35.5742593Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5742597Z 2025-05-07T20:32:35.5742694Z moe/activation_test.py:117: 2025-05-07T20:32:35.5742827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5742930Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5743027Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5743404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.5743503Z return fn(*args, **kwargs) 2025-05-07T20:32:35.5744013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5744111Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5744479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5744707Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5745060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5745155Z kernel = self.compile( 2025-05-07T20:32:35.5745549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5745727Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5745859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5745863Z 2025-05-07T20:32:35.5746075Z self = 2025-05-07T20:32:35.5746877Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5747446Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917dac4a0>} 2025-05-07T20:32:35.5748226Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5748423Z context = 2025-05-07T20:32:35.5748427Z 2025-05-07T20:32:35.5748598Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5748869Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5748983Z module_map=module_map) 2025-05-07T20:32:35.5749148Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5749247Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5749326Z E ^ 2025-05-07T20:32:35.5749697Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5749701Z 2025-05-07T20:32:35.5750178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5750187Z 2025-05-07T20:32:35.5750291Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5750562Z self=, 2025-05-07T20:32:35.5750641Z T=4096, 2025-05-07T20:32:35.5750717Z D=5120, 2025-05-07T20:32:35.5750798Z scale_ub=None, 2025-05-07T20:32:35.5750888Z contiguous=False, 2025-05-07T20:32:35.5750969Z compiled=True, 2025-05-07T20:32:35.5751082Z ) 2025-05-07T20:32:35.5751313Z self = 2025-05-07T20:32:35.5751491Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.5751495Z 2025-05-07T20:32:35.5751577Z @given( 2025-05-07T20:32:35.5751696Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5751795Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5751912Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5752028Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5752143Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5752222Z ) 2025-05-07T20:32:35.5752476Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5752567Z def test_silu_mul_quant( 2025-05-07T20:32:35.5752644Z self, 2025-05-07T20:32:35.5752721Z T: int, 2025-05-07T20:32:35.5752795Z D: int, 2025-05-07T20:32:35.5752901Z scale_ub: Optional[float], 2025-05-07T20:32:35.5752989Z contiguous: bool, 2025-05-07T20:32:35.5753078Z compiled: bool, 2025-05-07T20:32:35.5753153Z ) -> None: 2025-05-07T20:32:35.5753249Z torch.manual_seed(2025) 2025-05-07T20:32:35.5753325Z 2025-05-07T20:32:35.5753497Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5753574Z 2025-05-07T20:32:35.5753667Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5753791Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5753878Z x = x_sign * x_clamp 2025-05-07T20:32:35.5753963Z x0 = x[:, :D] 2025-05-07T20:32:35.5754041Z x1 = x[:, D:] 2025-05-07T20:32:35.5754111Z 2025-05-07T20:32:35.5754196Z if contiguous: 2025-05-07T20:32:35.5754286Z x0 = x0.contiguous() 2025-05-07T20:32:35.5754376Z x1 = x1.contiguous() 2025-05-07T20:32:35.5754447Z 2025-05-07T20:32:35.5754539Z if scale_ub is not None: 2025-05-07T20:32:35.5754652Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5754788Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5754862Z ) 2025-05-07T20:32:35.5754989Z else: 2025-05-07T20:32:35.5755084Z scale_ub_tensor = None 2025-05-07T20:32:35.5755154Z 2025-05-07T20:32:35.5755286Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5755376Z op = silu_mul_quant 2025-05-07T20:32:35.5755458Z if compiled: 2025-05-07T20:32:35.5755561Z op = torch.compile(op) 2025-05-07T20:32:35.5755670Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5755739Z 2025-05-07T20:32:35.5755836Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5755840Z 2025-05-07T20:32:35.5755936Z moe/activation_test.py:117: 2025-05-07T20:32:35.5756068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5756171Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5756270Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5756661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.5756754Z return fn(*args, **kwargs) 2025-05-07T20:32:35.5757312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5757430Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5757915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5758207Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5758561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5758652Z kernel = self.compile( 2025-05-07T20:32:35.5759099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5759275Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5759410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5759414Z 2025-05-07T20:32:35.5759625Z self = 2025-05-07T20:32:35.5760436Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5760959Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917dad1c0>} 2025-05-07T20:32:35.5761739Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5761939Z context = 2025-05-07T20:32:35.5761944Z 2025-05-07T20:32:35.5762114Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5762394Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5762505Z module_map=module_map) 2025-05-07T20:32:35.5762668Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5762803Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5762884Z E ^ 2025-05-07T20:32:35.5763272Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5763276Z 2025-05-07T20:32:35.5763718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5763725Z 2025-05-07T20:32:35.5763826Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5764103Z self=, 2025-05-07T20:32:35.5764181Z T=4096, 2025-05-07T20:32:35.5764257Z D=5120, 2025-05-07T20:32:35.5764345Z scale_ub=1200.0, 2025-05-07T20:32:35.5764431Z contiguous=False, 2025-05-07T20:32:35.5764514Z compiled=False, 2025-05-07T20:32:35.5764589Z ) 2025-05-07T20:32:35.5764815Z self = 2025-05-07T20:32:35.5764999Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.5765004Z 2025-05-07T20:32:35.5765082Z @given( 2025-05-07T20:32:35.5765201Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5765309Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5765425Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5765543Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5765658Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5765730Z ) 2025-05-07T20:32:35.5765985Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5766080Z def test_silu_mul_quant( 2025-05-07T20:32:35.5766201Z self, 2025-05-07T20:32:35.5766280Z T: int, 2025-05-07T20:32:35.5766360Z D: int, 2025-05-07T20:32:35.5766456Z scale_ub: Optional[float], 2025-05-07T20:32:35.5766546Z contiguous: bool, 2025-05-07T20:32:35.5766680Z compiled: bool, 2025-05-07T20:32:35.5766758Z ) -> None: 2025-05-07T20:32:35.5766853Z torch.manual_seed(2025) 2025-05-07T20:32:35.5766924Z 2025-05-07T20:32:35.5767094Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5767168Z 2025-05-07T20:32:35.5767304Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5767428Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5767516Z x = x_sign * x_clamp 2025-05-07T20:32:35.5767595Z x0 = x[:, :D] 2025-05-07T20:32:35.5767675Z x1 = x[:, D:] 2025-05-07T20:32:35.5767750Z 2025-05-07T20:32:35.5767833Z if contiguous: 2025-05-07T20:32:35.5767923Z x0 = x0.contiguous() 2025-05-07T20:32:35.5768014Z x1 = x1.contiguous() 2025-05-07T20:32:35.5768088Z 2025-05-07T20:32:35.5768178Z if scale_ub is not None: 2025-05-07T20:32:35.5768286Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5768425Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5768501Z ) 2025-05-07T20:32:35.5768577Z else: 2025-05-07T20:32:35.5768669Z scale_ub_tensor = None 2025-05-07T20:32:35.5768743Z 2025-05-07T20:32:35.5768870Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5768962Z op = silu_mul_quant 2025-05-07T20:32:35.5769049Z if compiled: 2025-05-07T20:32:35.5769148Z op = torch.compile(op) 2025-05-07T20:32:35.5769253Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5769328Z 2025-05-07T20:32:35.5769417Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5769421Z 2025-05-07T20:32:35.5769520Z moe/activation_test.py:117: 2025-05-07T20:32:35.5769651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5769752Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5769852Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5770376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5770471Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5770846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5771079Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5771483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5771576Z kernel = self.compile( 2025-05-07T20:32:35.5772061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5772247Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5772378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5772385Z 2025-05-07T20:32:35.5772591Z self = 2025-05-07T20:32:35.5773417Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5773949Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917dae160>} 2025-05-07T20:32:35.5774791Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5774988Z context = 2025-05-07T20:32:35.5774993Z 2025-05-07T20:32:35.5775230Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5775504Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5775611Z module_map=module_map) 2025-05-07T20:32:35.5775776Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5775916Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5775992Z E ^ 2025-05-07T20:32:35.5776368Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5776372Z 2025-05-07T20:32:35.5776809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5776814Z 2025-05-07T20:32:35.5776919Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5777149Z self=, 2025-05-07T20:32:35.5777227Z T=4096, 2025-05-07T20:32:35.5777305Z D=5120, 2025-05-07T20:32:35.5777387Z scale_ub=1200.0, 2025-05-07T20:32:35.5777472Z contiguous=False, 2025-05-07T20:32:35.5777558Z compiled=True, 2025-05-07T20:32:35.5777630Z ) 2025-05-07T20:32:35.5777859Z self = 2025-05-07T20:32:35.5778042Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.5778047Z 2025-05-07T20:32:35.5778121Z @given( 2025-05-07T20:32:35.5778244Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5778346Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5778460Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5778584Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5778697Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5778772Z ) 2025-05-07T20:32:35.5779027Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5779120Z def test_silu_mul_quant( 2025-05-07T20:32:35.5779197Z self, 2025-05-07T20:32:35.5779273Z T: int, 2025-05-07T20:32:35.5779348Z D: int, 2025-05-07T20:32:35.5779448Z scale_ub: Optional[float], 2025-05-07T20:32:35.5779536Z contiguous: bool, 2025-05-07T20:32:35.5779622Z compiled: bool, 2025-05-07T20:32:35.5779703Z ) -> None: 2025-05-07T20:32:35.5779796Z torch.manual_seed(2025) 2025-05-07T20:32:35.5779868Z 2025-05-07T20:32:35.5780089Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5780164Z 2025-05-07T20:32:35.5780256Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5780384Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5780472Z x = x_sign * x_clamp 2025-05-07T20:32:35.5780556Z x0 = x[:, :D] 2025-05-07T20:32:35.5780634Z x1 = x[:, D:] 2025-05-07T20:32:35.5780707Z 2025-05-07T20:32:35.5780795Z if contiguous: 2025-05-07T20:32:35.5780885Z x0 = x0.contiguous() 2025-05-07T20:32:35.5780972Z x1 = x1.contiguous() 2025-05-07T20:32:35.5781046Z 2025-05-07T20:32:35.5781135Z if scale_ub is not None: 2025-05-07T20:32:35.5781237Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5781378Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5781454Z ) 2025-05-07T20:32:35.5781530Z else: 2025-05-07T20:32:35.5781625Z scale_ub_tensor = None 2025-05-07T20:32:35.5781700Z 2025-05-07T20:32:35.5781832Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5781920Z op = silu_mul_quant 2025-05-07T20:32:35.5782049Z if compiled: 2025-05-07T20:32:35.5782154Z op = torch.compile(op) 2025-05-07T20:32:35.5782259Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5782331Z 2025-05-07T20:32:35.5782496Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5782500Z 2025-05-07T20:32:35.5782596Z moe/activation_test.py:117: 2025-05-07T20:32:35.5782726Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5782830Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5782927Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5783358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.5783450Z return fn(*args, **kwargs) 2025-05-07T20:32:35.5783972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5784074Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5784452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5784682Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5785045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5785138Z kernel = self.compile( 2025-05-07T20:32:35.5785544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5785724Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5785853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5785858Z 2025-05-07T20:32:35.5786069Z self = 2025-05-07T20:32:35.5786895Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5787425Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917daf240>} 2025-05-07T20:32:35.5788216Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5788410Z context = 2025-05-07T20:32:35.5788420Z 2025-05-07T20:32:35.5788629Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5788904Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5789017Z module_map=module_map) 2025-05-07T20:32:35.5789180Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5789277Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5789357Z E ^ 2025-05-07T20:32:35.5789730Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5789734Z 2025-05-07T20:32:35.5790176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5790183Z 2025-05-07T20:32:35.5790286Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5790516Z self=, 2025-05-07T20:32:35.5790594Z T=2048, 2025-05-07T20:32:35.5790671Z D=7168, 2025-05-07T20:32:35.5790760Z scale_ub=1200.0, 2025-05-07T20:32:35.5790849Z contiguous=False, 2025-05-07T20:32:35.5790931Z compiled=False, 2025-05-07T20:32:35.5791004Z ) 2025-05-07T20:32:35.5791278Z self = 2025-05-07T20:32:35.5791461Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.5791590Z 2025-05-07T20:32:35.5791670Z @given( 2025-05-07T20:32:35.5791791Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5791889Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5792007Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5792123Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5792277Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5792352Z ) 2025-05-07T20:32:35.5792608Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5792705Z def test_silu_mul_quant( 2025-05-07T20:32:35.5792780Z self, 2025-05-07T20:32:35.5792855Z T: int, 2025-05-07T20:32:35.5792935Z D: int, 2025-05-07T20:32:35.5793035Z scale_ub: Optional[float], 2025-05-07T20:32:35.5793123Z contiguous: bool, 2025-05-07T20:32:35.5793214Z compiled: bool, 2025-05-07T20:32:35.5793289Z ) -> None: 2025-05-07T20:32:35.5793386Z torch.manual_seed(2025) 2025-05-07T20:32:35.5793459Z 2025-05-07T20:32:35.5793628Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5793701Z 2025-05-07T20:32:35.5793796Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5793919Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5794009Z x = x_sign * x_clamp 2025-05-07T20:32:35.5794089Z x0 = x[:, :D] 2025-05-07T20:32:35.5794167Z x1 = x[:, D:] 2025-05-07T20:32:35.5794238Z 2025-05-07T20:32:35.5794326Z if contiguous: 2025-05-07T20:32:35.5794417Z x0 = x0.contiguous() 2025-05-07T20:32:35.5794507Z x1 = x1.contiguous() 2025-05-07T20:32:35.5794577Z 2025-05-07T20:32:35.5794667Z if scale_ub is not None: 2025-05-07T20:32:35.5794775Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5794911Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5794987Z ) 2025-05-07T20:32:35.5795068Z else: 2025-05-07T20:32:35.5795161Z scale_ub_tensor = None 2025-05-07T20:32:35.5795232Z 2025-05-07T20:32:35.5795365Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5795453Z op = silu_mul_quant 2025-05-07T20:32:35.5795536Z if compiled: 2025-05-07T20:32:35.5795639Z op = torch.compile(op) 2025-05-07T20:32:35.5795742Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5795814Z 2025-05-07T20:32:35.5795903Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5795907Z 2025-05-07T20:32:35.5796051Z moe/activation_test.py:117: 2025-05-07T20:32:35.5796187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5796289Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5796388Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5796915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5797011Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5797386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5797616Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5797976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5798071Z kernel = self.compile( 2025-05-07T20:32:35.5798472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5798694Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5798827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5798831Z 2025-05-07T20:32:35.5799039Z self = 2025-05-07T20:32:35.5799905Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5800429Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917a54220>} 2025-05-07T20:32:35.5801260Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5801458Z context = 2025-05-07T20:32:35.5801463Z 2025-05-07T20:32:35.5801630Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5801909Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5802020Z module_map=module_map) 2025-05-07T20:32:35.5802181Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5802283Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5802360Z E ^ 2025-05-07T20:32:35.5802736Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5802741Z 2025-05-07T20:32:35.5803179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5803183Z 2025-05-07T20:32:35.5803285Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5803539Z self=, 2025-05-07T20:32:35.5803622Z T=1, 2025-05-07T20:32:35.5803714Z D=7168, 2025-05-07T20:32:35.5803804Z scale_ub=None, 2025-05-07T20:32:35.5803889Z contiguous=True, 2025-05-07T20:32:35.5803974Z compiled=False, 2025-05-07T20:32:35.5804047Z ) 2025-05-07T20:32:35.5804273Z self = 2025-05-07T20:32:35.5804443Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.5804447Z 2025-05-07T20:32:35.5804525Z @given( 2025-05-07T20:32:35.5804644Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5804745Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5804903Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5805026Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5805138Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5805211Z ) 2025-05-07T20:32:35.5805467Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5805560Z def test_silu_mul_quant( 2025-05-07T20:32:35.5805637Z self, 2025-05-07T20:32:35.5805715Z T: int, 2025-05-07T20:32:35.5805790Z D: int, 2025-05-07T20:32:35.5805888Z scale_ub: Optional[float], 2025-05-07T20:32:35.5805980Z contiguous: bool, 2025-05-07T20:32:35.5806066Z compiled: bool, 2025-05-07T20:32:35.5806319Z ) -> None: 2025-05-07T20:32:35.5806459Z torch.manual_seed(2025) 2025-05-07T20:32:35.5806566Z 2025-05-07T20:32:35.5806757Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5806832Z 2025-05-07T20:32:35.5806923Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5807052Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5807140Z x = x_sign * x_clamp 2025-05-07T20:32:35.5807219Z x0 = x[:, :D] 2025-05-07T20:32:35.5807416Z x1 = x[:, D:] 2025-05-07T20:32:35.5807489Z 2025-05-07T20:32:35.5807572Z if contiguous: 2025-05-07T20:32:35.5807665Z x0 = x0.contiguous() 2025-05-07T20:32:35.5807807Z x1 = x1.contiguous() 2025-05-07T20:32:35.5807879Z 2025-05-07T20:32:35.5807973Z if scale_ub is not None: 2025-05-07T20:32:35.5808078Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5808215Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5808293Z ) 2025-05-07T20:32:35.5808430Z else: 2025-05-07T20:32:35.5808529Z scale_ub_tensor = None 2025-05-07T20:32:35.5808601Z 2025-05-07T20:32:35.5808733Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5808829Z op = silu_mul_quant 2025-05-07T20:32:35.5808912Z if compiled: 2025-05-07T20:32:35.5809010Z op = torch.compile(op) 2025-05-07T20:32:35.5809121Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5809193Z 2025-05-07T20:32:35.5809282Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5809287Z 2025-05-07T20:32:35.5809386Z moe/activation_test.py:117: 2025-05-07T20:32:35.5809518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5809623Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5809722Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5810240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5810342Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5810715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5810945Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5811304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5811398Z kernel = self.compile( 2025-05-07T20:32:35.5811802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5812042Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5812171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5812176Z 2025-05-07T20:32:35.5812385Z self = 2025-05-07T20:32:35.5813274Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5813812Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917a55120>} 2025-05-07T20:32:35.5814605Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5814803Z context = 2025-05-07T20:32:35.5814812Z 2025-05-07T20:32:35.5814979Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5815255Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5815367Z module_map=module_map) 2025-05-07T20:32:35.5815532Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5815631Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5815711Z E ^ 2025-05-07T20:32:35.5816125Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5816130Z 2025-05-07T20:32:35.5816574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5816617Z 2025-05-07T20:32:35.5816720Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5816953Z self=, 2025-05-07T20:32:35.5817032Z T=16384, 2025-05-07T20:32:35.5817107Z D=7168, 2025-05-07T20:32:35.5817190Z scale_ub=1200.0, 2025-05-07T20:32:35.5817322Z contiguous=False, 2025-05-07T20:32:35.5817404Z compiled=True, 2025-05-07T20:32:35.5817475Z ) 2025-05-07T20:32:35.5817705Z self = 2025-05-07T20:32:35.5817896Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.5817900Z 2025-05-07T20:32:35.5817980Z @given( 2025-05-07T20:32:35.5818100Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5818199Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5818317Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5818433Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5818550Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5818627Z ) 2025-05-07T20:32:35.5818881Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5818976Z def test_silu_mul_quant( 2025-05-07T20:32:35.5819052Z self, 2025-05-07T20:32:35.5819130Z T: int, 2025-05-07T20:32:35.5819207Z D: int, 2025-05-07T20:32:35.5819304Z scale_ub: Optional[float], 2025-05-07T20:32:35.5819393Z contiguous: bool, 2025-05-07T20:32:35.5819486Z compiled: bool, 2025-05-07T20:32:35.5819562Z ) -> None: 2025-05-07T20:32:35.5819657Z torch.manual_seed(2025) 2025-05-07T20:32:35.5819730Z 2025-05-07T20:32:35.5819903Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5819976Z 2025-05-07T20:32:35.5820070Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5820194Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5820284Z x = x_sign * x_clamp 2025-05-07T20:32:35.5820368Z x0 = x[:, :D] 2025-05-07T20:32:35.5820446Z x1 = x[:, D:] 2025-05-07T20:32:35.5820517Z 2025-05-07T20:32:35.5820598Z if contiguous: 2025-05-07T20:32:35.5820687Z x0 = x0.contiguous() 2025-05-07T20:32:35.5820780Z x1 = x1.contiguous() 2025-05-07T20:32:35.5820850Z 2025-05-07T20:32:35.5820939Z if scale_ub is not None: 2025-05-07T20:32:35.5821045Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5821227Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5821304Z ) 2025-05-07T20:32:35.5821385Z else: 2025-05-07T20:32:35.5821478Z scale_ub_tensor = None 2025-05-07T20:32:35.5821551Z 2025-05-07T20:32:35.5821682Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5821773Z op = silu_mul_quant 2025-05-07T20:32:35.5821860Z if compiled: 2025-05-07T20:32:35.5821960Z op = torch.compile(op) 2025-05-07T20:32:35.5822064Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5822137Z 2025-05-07T20:32:35.5822227Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5822232Z 2025-05-07T20:32:35.5822328Z moe/activation_test.py:117: 2025-05-07T20:32:35.5822464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5822562Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5822660Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5823053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.5823146Z return fn(*args, **kwargs) 2025-05-07T20:32:35.5823714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5823814Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5824232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5824464Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5824817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5824949Z kernel = self.compile( 2025-05-07T20:32:35.5825352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5825532Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5825664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5825671Z 2025-05-07T20:32:35.5825879Z self = 2025-05-07T20:32:35.5826695Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5827228Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917a56520>} 2025-05-07T20:32:35.5828020Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5828219Z context = 2025-05-07T20:32:35.5828224Z 2025-05-07T20:32:35.5828395Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5828670Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5828777Z module_map=module_map) 2025-05-07T20:32:35.5828943Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5829044Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5829119Z E ^ 2025-05-07T20:32:35.5829489Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5829497Z 2025-05-07T20:32:35.5829935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5829939Z 2025-05-07T20:32:35.5830083Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5830318Z self=, 2025-05-07T20:32:35.5830394Z T=1, 2025-05-07T20:32:35.5830473Z D=7168, 2025-05-07T20:32:35.5830557Z scale_ub=None, 2025-05-07T20:32:35.5830644Z contiguous=False, 2025-05-07T20:32:35.5830727Z compiled=False, 2025-05-07T20:32:35.5830804Z ) 2025-05-07T20:32:35.5831030Z self = 2025-05-07T20:32:35.5831200Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.5831207Z 2025-05-07T20:32:35.5831283Z @given( 2025-05-07T20:32:35.5831401Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5831504Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5831618Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5831734Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5831853Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5831925Z ) 2025-05-07T20:32:35.5832179Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5832316Z def test_silu_mul_quant( 2025-05-07T20:32:35.5832392Z self, 2025-05-07T20:32:35.5832467Z T: int, 2025-05-07T20:32:35.5832547Z D: int, 2025-05-07T20:32:35.5832682Z scale_ub: Optional[float], 2025-05-07T20:32:35.5832773Z contiguous: bool, 2025-05-07T20:32:35.5832857Z compiled: bool, 2025-05-07T20:32:35.5832934Z ) -> None: 2025-05-07T20:32:35.5833031Z torch.manual_seed(2025) 2025-05-07T20:32:35.5833103Z 2025-05-07T20:32:35.5833280Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5833414Z 2025-05-07T20:32:35.5833527Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5833654Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5833750Z x = x_sign * x_clamp 2025-05-07T20:32:35.5833828Z x0 = x[:, :D] 2025-05-07T20:32:35.5833904Z x1 = x[:, D:] 2025-05-07T20:32:35.5833980Z 2025-05-07T20:32:35.5834063Z if contiguous: 2025-05-07T20:32:35.5834155Z x0 = x0.contiguous() 2025-05-07T20:32:35.5834241Z x1 = x1.contiguous() 2025-05-07T20:32:35.5834310Z 2025-05-07T20:32:35.5834400Z if scale_ub is not None: 2025-05-07T20:32:35.5834509Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5834645Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5834722Z ) 2025-05-07T20:32:35.5834797Z else: 2025-05-07T20:32:35.5834890Z scale_ub_tensor = None 2025-05-07T20:32:35.5834964Z 2025-05-07T20:32:35.5835095Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5835183Z op = silu_mul_quant 2025-05-07T20:32:35.5835268Z if compiled: 2025-05-07T20:32:35.5835368Z op = torch.compile(op) 2025-05-07T20:32:35.5835474Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5835545Z 2025-05-07T20:32:35.5835635Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5835642Z 2025-05-07T20:32:35.5835742Z moe/activation_test.py:117: 2025-05-07T20:32:35.5835871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5835971Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5836076Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5836596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5836692Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5837071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5837300Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5837730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5837825Z kernel = self.compile( 2025-05-07T20:32:35.5838227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5838407Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5838537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5838542Z 2025-05-07T20:32:35.5838751Z self = 2025-05-07T20:32:35.5839573Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5840103Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917a57100>} 2025-05-07T20:32:35.5840933Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5841128Z context = 2025-05-07T20:32:35.5841168Z 2025-05-07T20:32:35.5841342Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5841614Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5841721Z module_map=module_map) 2025-05-07T20:32:35.5841924Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5842023Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5842103Z E ^ 2025-05-07T20:32:35.5842475Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5842480Z 2025-05-07T20:32:35.5842917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5842922Z 2025-05-07T20:32:35.5843029Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5843263Z self=, 2025-05-07T20:32:35.5843344Z T=2048, 2025-05-07T20:32:35.5843439Z D=7168, 2025-05-07T20:32:35.5843526Z scale_ub=None, 2025-05-07T20:32:35.5843639Z contiguous=False, 2025-05-07T20:32:35.5843724Z compiled=True, 2025-05-07T20:32:35.5843796Z ) 2025-05-07T20:32:35.5844029Z self = 2025-05-07T20:32:35.5844207Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.5844211Z 2025-05-07T20:32:35.5844287Z @given( 2025-05-07T20:32:35.5844411Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5844509Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5844625Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5844744Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5844856Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5844933Z ) 2025-05-07T20:32:35.5845187Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5845280Z def test_silu_mul_quant( 2025-05-07T20:32:35.5845359Z self, 2025-05-07T20:32:35.5845434Z T: int, 2025-05-07T20:32:35.5845509Z D: int, 2025-05-07T20:32:35.5845610Z scale_ub: Optional[float], 2025-05-07T20:32:35.5845702Z contiguous: bool, 2025-05-07T20:32:35.5845787Z compiled: bool, 2025-05-07T20:32:35.5845866Z ) -> None: 2025-05-07T20:32:35.5845959Z torch.manual_seed(2025) 2025-05-07T20:32:35.5846076Z 2025-05-07T20:32:35.5846250Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5846324Z 2025-05-07T20:32:35.5846422Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5846547Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5846634Z x = x_sign * x_clamp 2025-05-07T20:32:35.5846715Z x0 = x[:, :D] 2025-05-07T20:32:35.5846795Z x1 = x[:, D:] 2025-05-07T20:32:35.5846866Z 2025-05-07T20:32:35.5846951Z if contiguous: 2025-05-07T20:32:35.5847040Z x0 = x0.contiguous() 2025-05-07T20:32:35.5847127Z x1 = x1.contiguous() 2025-05-07T20:32:35.5847202Z 2025-05-07T20:32:35.5847291Z if scale_ub is not None: 2025-05-07T20:32:35.5847397Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5847534Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5847608Z ) 2025-05-07T20:32:35.5847687Z else: 2025-05-07T20:32:35.5847782Z scale_ub_tensor = None 2025-05-07T20:32:35.5847852Z 2025-05-07T20:32:35.5847982Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5848115Z op = silu_mul_quant 2025-05-07T20:32:35.5848200Z if compiled: 2025-05-07T20:32:35.5848301Z op = torch.compile(op) 2025-05-07T20:32:35.5848404Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5848513Z 2025-05-07T20:32:35.5848606Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5848610Z 2025-05-07T20:32:35.5848705Z moe/activation_test.py:117: 2025-05-07T20:32:35.5848839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5848938Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5849075Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5849459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.5849553Z return fn(*args, **kwargs) 2025-05-07T20:32:35.5850069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5850169Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5850545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5850781Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5851133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5851226Z kernel = self.compile( 2025-05-07T20:32:35.5851630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5851856Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5851991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5851996Z 2025-05-07T20:32:35.5852208Z self = 2025-05-07T20:32:35.5853029Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5853563Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac8b44720>} 2025-05-07T20:32:35.5854352Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5854554Z context = 2025-05-07T20:32:35.5854558Z 2025-05-07T20:32:35.5854776Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5858599Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5858783Z module_map=module_map) 2025-05-07T20:32:35.5859005Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5859113Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5859196Z E ^ 2025-05-07T20:32:35.5859632Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5859638Z 2025-05-07T20:32:35.5860075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5860084Z 2025-05-07T20:32:35.5860186Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5860415Z self=, 2025-05-07T20:32:35.5860498Z T=4096, 2025-05-07T20:32:35.5860575Z D=7168, 2025-05-07T20:32:35.5860656Z scale_ub=None, 2025-05-07T20:32:35.5860746Z contiguous=False, 2025-05-07T20:32:35.5860908Z compiled=True, 2025-05-07T20:32:35.5860986Z ) 2025-05-07T20:32:35.5861216Z self = 2025-05-07T20:32:35.5861391Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.5861437Z 2025-05-07T20:32:35.5861515Z @given( 2025-05-07T20:32:35.5861636Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5861734Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5861850Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5862006Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5862122Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5862196Z ) 2025-05-07T20:32:35.5862456Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5862548Z def test_silu_mul_quant( 2025-05-07T20:32:35.5862629Z self, 2025-05-07T20:32:35.5862707Z T: int, 2025-05-07T20:32:35.5862784Z D: int, 2025-05-07T20:32:35.5862881Z scale_ub: Optional[float], 2025-05-07T20:32:35.5862968Z contiguous: bool, 2025-05-07T20:32:35.5863054Z compiled: bool, 2025-05-07T20:32:35.5863134Z ) -> None: 2025-05-07T20:32:35.5863226Z torch.manual_seed(2025) 2025-05-07T20:32:35.5863299Z 2025-05-07T20:32:35.5863472Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5863546Z 2025-05-07T20:32:35.5863640Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5863767Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5863854Z x = x_sign * x_clamp 2025-05-07T20:32:35.5863935Z x0 = x[:, :D] 2025-05-07T20:32:35.5864016Z x1 = x[:, D:] 2025-05-07T20:32:35.5864092Z 2025-05-07T20:32:35.5864174Z if contiguous: 2025-05-07T20:32:35.5864261Z x0 = x0.contiguous() 2025-05-07T20:32:35.5864353Z x1 = x1.contiguous() 2025-05-07T20:32:35.5864428Z 2025-05-07T20:32:35.5864518Z if scale_ub is not None: 2025-05-07T20:32:35.5864626Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5864762Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5864838Z ) 2025-05-07T20:32:35.5864916Z else: 2025-05-07T20:32:35.5865007Z scale_ub_tensor = None 2025-05-07T20:32:35.5865078Z 2025-05-07T20:32:35.5865209Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5865298Z op = silu_mul_quant 2025-05-07T20:32:35.5865384Z if compiled: 2025-05-07T20:32:35.5865489Z op = torch.compile(op) 2025-05-07T20:32:35.5865593Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5865668Z 2025-05-07T20:32:35.5865806Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5865811Z 2025-05-07T20:32:35.5865910Z moe/activation_test.py:117: 2025-05-07T20:32:35.5866052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5866152Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5866250Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5866635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.5866732Z return fn(*args, **kwargs) 2025-05-07T20:32:35.5867244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5867344Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5867711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5867947Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5868298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5868433Z kernel = self.compile( 2025-05-07T20:32:35.5868901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5869098Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5869278Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5869283Z 2025-05-07T20:32:35.5869512Z self = 2025-05-07T20:32:35.5870488Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5871079Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac8b45440>} 2025-05-07T20:32:35.5871854Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5872059Z context = 2025-05-07T20:32:35.5872064Z 2025-05-07T20:32:35.5872232Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5872502Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5872610Z module_map=module_map) 2025-05-07T20:32:35.5872772Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5872875Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5872954Z E ^ 2025-05-07T20:32:35.5873318Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5873323Z 2025-05-07T20:32:35.5873756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5873760Z 2025-05-07T20:32:35.5873862Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5874117Z self=, 2025-05-07T20:32:35.5874224Z T=16384, 2025-05-07T20:32:35.5874334Z D=5120, 2025-05-07T20:32:35.5874431Z scale_ub=1200.0, 2025-05-07T20:32:35.5874519Z contiguous=False, 2025-05-07T20:32:35.5874602Z compiled=False, 2025-05-07T20:32:35.5874680Z ) 2025-05-07T20:32:35.5874906Z self = 2025-05-07T20:32:35.5875092Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.5875156Z 2025-05-07T20:32:35.5875234Z @given( 2025-05-07T20:32:35.5875355Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5875458Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5875574Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5875690Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5875812Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5875885Z ) 2025-05-07T20:32:35.5876138Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5876235Z def test_silu_mul_quant( 2025-05-07T20:32:35.5876312Z self, 2025-05-07T20:32:35.5876391Z T: int, 2025-05-07T20:32:35.5876471Z D: int, 2025-05-07T20:32:35.5876566Z scale_ub: Optional[float], 2025-05-07T20:32:35.5876656Z contiguous: bool, 2025-05-07T20:32:35.5876741Z compiled: bool, 2025-05-07T20:32:35.5876818Z ) -> None: 2025-05-07T20:32:35.5876917Z torch.manual_seed(2025) 2025-05-07T20:32:35.5876986Z 2025-05-07T20:32:35.5877155Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5877278Z 2025-05-07T20:32:35.5877372Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5877496Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5877585Z x = x_sign * x_clamp 2025-05-07T20:32:35.5877703Z x0 = x[:, :D] 2025-05-07T20:32:35.5877782Z x1 = x[:, D:] 2025-05-07T20:32:35.5877856Z 2025-05-07T20:32:35.5877940Z if contiguous: 2025-05-07T20:32:35.5878034Z x0 = x0.contiguous() 2025-05-07T20:32:35.5878122Z x1 = x1.contiguous() 2025-05-07T20:32:35.5878234Z 2025-05-07T20:32:35.5878325Z if scale_ub is not None: 2025-05-07T20:32:35.5878429Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5878564Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5878647Z ) 2025-05-07T20:32:35.5878723Z else: 2025-05-07T20:32:35.5878816Z scale_ub_tensor = None 2025-05-07T20:32:35.5878892Z 2025-05-07T20:32:35.5879022Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5879112Z op = silu_mul_quant 2025-05-07T20:32:35.5879197Z if compiled: 2025-05-07T20:32:35.5879296Z op = torch.compile(op) 2025-05-07T20:32:35.5879401Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5879475Z 2025-05-07T20:32:35.5879563Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5879568Z 2025-05-07T20:32:35.5879665Z moe/activation_test.py:117: 2025-05-07T20:32:35.5879796Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5879897Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5879998Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5880515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5880611Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5880982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5881209Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5881562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5881654Z kernel = self.compile( 2025-05-07T20:32:35.5882046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5882229Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5882361Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5882365Z 2025-05-07T20:32:35.5882621Z self = 2025-05-07T20:32:35.5883427Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5883946Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac8b46340>} 2025-05-07T20:32:35.5884723Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5884922Z context = 2025-05-07T20:32:35.5884926Z 2025-05-07T20:32:35.5885098Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5885368Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5885473Z module_map=module_map) 2025-05-07T20:32:35.5885680Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5885780Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5885859Z E ^ 2025-05-07T20:32:35.5886222Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5886264Z 2025-05-07T20:32:35.5886691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5886696Z 2025-05-07T20:32:35.5886802Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5887076Z self=, 2025-05-07T20:32:35.5887153Z T=16384, 2025-05-07T20:32:35.5887229Z D=5120, 2025-05-07T20:32:35.5887312Z scale_ub=1200.0, 2025-05-07T20:32:35.5887402Z contiguous=True, 2025-05-07T20:32:35.5887484Z compiled=True, 2025-05-07T20:32:35.5887556Z ) 2025-05-07T20:32:35.5887784Z self = 2025-05-07T20:32:35.5887965Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.5887969Z 2025-05-07T20:32:35.5888044Z @given( 2025-05-07T20:32:35.5888171Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5888271Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5888384Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5888501Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5888614Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5888692Z ) 2025-05-07T20:32:35.5888940Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5889033Z def test_silu_mul_quant( 2025-05-07T20:32:35.5889112Z self, 2025-05-07T20:32:35.5889189Z T: int, 2025-05-07T20:32:35.5889265Z D: int, 2025-05-07T20:32:35.5889363Z scale_ub: Optional[float], 2025-05-07T20:32:35.5889454Z contiguous: bool, 2025-05-07T20:32:35.5889540Z compiled: bool, 2025-05-07T20:32:35.5889618Z ) -> None: 2025-05-07T20:32:35.5889710Z torch.manual_seed(2025) 2025-05-07T20:32:35.5889783Z 2025-05-07T20:32:35.5889955Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5890029Z 2025-05-07T20:32:35.5890124Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5890247Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5890333Z x = x_sign * x_clamp 2025-05-07T20:32:35.5890417Z x0 = x[:, :D] 2025-05-07T20:32:35.5890495Z x1 = x[:, D:] 2025-05-07T20:32:35.5890566Z 2025-05-07T20:32:35.5890653Z if contiguous: 2025-05-07T20:32:35.5890742Z x0 = x0.contiguous() 2025-05-07T20:32:35.5890879Z x1 = x1.contiguous() 2025-05-07T20:32:35.5890956Z 2025-05-07T20:32:35.5891045Z if scale_ub is not None: 2025-05-07T20:32:35.5891151Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5891293Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5891368Z ) 2025-05-07T20:32:35.5891449Z else: 2025-05-07T20:32:35.5891543Z scale_ub_tensor = None 2025-05-07T20:32:35.5891613Z 2025-05-07T20:32:35.5891743Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5891925Z op = silu_mul_quant 2025-05-07T20:32:35.5892010Z if compiled: 2025-05-07T20:32:35.5892111Z op = torch.compile(op) 2025-05-07T20:32:35.5892217Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5892287Z 2025-05-07T20:32:35.5892381Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5892386Z 2025-05-07T20:32:35.5892480Z moe/activation_test.py:117: 2025-05-07T20:32:35.5892611Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5892713Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5892858Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5893238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.5893372Z return fn(*args, **kwargs) 2025-05-07T20:32:35.5893878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5893978Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5894344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5894610Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5894961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5895055Z kernel = self.compile( 2025-05-07T20:32:35.5895456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5895634Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5895763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5895769Z 2025-05-07T20:32:35.5895980Z self = 2025-05-07T20:32:35.5896782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5897311Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8ac8b479c0>} 2025-05-07T20:32:35.5898084Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5898281Z context = 2025-05-07T20:32:35.5898287Z 2025-05-07T20:32:35.5898453Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5898723Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5898832Z module_map=module_map) 2025-05-07T20:32:35.5898999Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5899098Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5899176Z E ^ 2025-05-07T20:32:35.5899586Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5899591Z 2025-05-07T20:32:35.5900028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5900032Z 2025-05-07T20:32:35.5900135Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5900362Z self=, 2025-05-07T20:32:35.5900445Z T=16384, 2025-05-07T20:32:35.5900523Z D=5120, 2025-05-07T20:32:35.5900605Z scale_ub=None, 2025-05-07T20:32:35.5900695Z contiguous=False, 2025-05-07T20:32:35.5900778Z compiled=True, 2025-05-07T20:32:35.5900851Z ) 2025-05-07T20:32:35.5901075Z self = 2025-05-07T20:32:35.5901259Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.5901263Z 2025-05-07T20:32:35.5901342Z @given( 2025-05-07T20:32:35.5901465Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5901563Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5901680Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5901841Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5901957Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5902029Z ) 2025-05-07T20:32:35.5902280Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5902455Z def test_silu_mul_quant( 2025-05-07T20:32:35.5902533Z self, 2025-05-07T20:32:35.5902607Z T: int, 2025-05-07T20:32:35.5902685Z D: int, 2025-05-07T20:32:35.5902782Z scale_ub: Optional[float], 2025-05-07T20:32:35.5902870Z contiguous: bool, 2025-05-07T20:32:35.5903001Z compiled: bool, 2025-05-07T20:32:35.5903078Z ) -> None: 2025-05-07T20:32:35.5903171Z torch.manual_seed(2025) 2025-05-07T20:32:35.5903246Z 2025-05-07T20:32:35.5903417Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5903490Z 2025-05-07T20:32:35.5903585Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5903710Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5903802Z x = x_sign * x_clamp 2025-05-07T20:32:35.5903881Z x0 = x[:, :D] 2025-05-07T20:32:35.5903956Z x1 = x[:, D:] 2025-05-07T20:32:35.5904035Z 2025-05-07T20:32:35.5904119Z if contiguous: 2025-05-07T20:32:35.5904207Z x0 = x0.contiguous() 2025-05-07T20:32:35.5904297Z x1 = x1.contiguous() 2025-05-07T20:32:35.5904368Z 2025-05-07T20:32:35.5904456Z if scale_ub is not None: 2025-05-07T20:32:35.5904561Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5904698Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5904772Z ) 2025-05-07T20:32:35.5904852Z else: 2025-05-07T20:32:35.5904947Z scale_ub_tensor = None 2025-05-07T20:32:35.5905025Z 2025-05-07T20:32:35.5905154Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5905245Z op = silu_mul_quant 2025-05-07T20:32:35.5905335Z if compiled: 2025-05-07T20:32:35.5905433Z op = torch.compile(op) 2025-05-07T20:32:35.5905537Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5905611Z 2025-05-07T20:32:35.5905703Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5905707Z 2025-05-07T20:32:35.5905802Z moe/activation_test.py:117: 2025-05-07T20:32:35.5905937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5906035Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5906314Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5906841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.5906968Z return fn(*args, **kwargs) 2025-05-07T20:32:35.5907581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5907684Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5908053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5908285Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5908636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5908734Z kernel = self.compile( 2025-05-07T20:32:35.5909129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5909311Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5909445Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5909452Z 2025-05-07T20:32:35.5909659Z self = 2025-05-07T20:32:35.5910537Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5911109Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917c7cc20>} 2025-05-07T20:32:35.5911881Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5912138Z context = 2025-05-07T20:32:35.5912142Z 2025-05-07T20:32:35.5912314Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5912589Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5912698Z module_map=module_map) 2025-05-07T20:32:35.5912860Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5912961Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5913039Z E ^ 2025-05-07T20:32:35.5913404Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5913412Z 2025-05-07T20:32:35.5913838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5913845Z 2025-05-07T20:32:35.5913948Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5914177Z self=, 2025-05-07T20:32:35.5914254Z T=2048, 2025-05-07T20:32:35.5914333Z D=5120, 2025-05-07T20:32:35.5914418Z scale_ub=None, 2025-05-07T20:32:35.5914504Z contiguous=False, 2025-05-07T20:32:35.5914586Z compiled=True, 2025-05-07T20:32:35.5914664Z ) 2025-05-07T20:32:35.5914888Z self = 2025-05-07T20:32:35.5915068Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.5915075Z 2025-05-07T20:32:35.5915151Z @given( 2025-05-07T20:32:35.5915270Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5915371Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5915485Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5915602Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5915720Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5915792Z ) 2025-05-07T20:32:35.5916046Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5916187Z def test_silu_mul_quant( 2025-05-07T20:32:35.5916264Z self, 2025-05-07T20:32:35.5916342Z T: int, 2025-05-07T20:32:35.5916418Z D: int, 2025-05-07T20:32:35.5916516Z scale_ub: Optional[float], 2025-05-07T20:32:35.5916608Z contiguous: bool, 2025-05-07T20:32:35.5916693Z compiled: bool, 2025-05-07T20:32:35.5916770Z ) -> None: 2025-05-07T20:32:35.5916869Z torch.manual_seed(2025) 2025-05-07T20:32:35.5916940Z 2025-05-07T20:32:35.5917108Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5917185Z 2025-05-07T20:32:35.5917277Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5917401Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5917493Z x = x_sign * x_clamp 2025-05-07T20:32:35.5917572Z x0 = x[:, :D] 2025-05-07T20:32:35.5917653Z x1 = x[:, D:] 2025-05-07T20:32:35.5917725Z 2025-05-07T20:32:35.5917805Z if contiguous: 2025-05-07T20:32:35.5917901Z x0 = x0.contiguous() 2025-05-07T20:32:35.5917994Z x1 = x1.contiguous() 2025-05-07T20:32:35.5918066Z 2025-05-07T20:32:35.5918203Z if scale_ub is not None: 2025-05-07T20:32:35.5918308Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5918443Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5918562Z ) 2025-05-07T20:32:35.5918641Z else: 2025-05-07T20:32:35.5918738Z scale_ub_tensor = None 2025-05-07T20:32:35.5918814Z 2025-05-07T20:32:35.5918942Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5919031Z op = silu_mul_quant 2025-05-07T20:32:35.5919118Z if compiled: 2025-05-07T20:32:35.5919259Z op = torch.compile(op) 2025-05-07T20:32:35.5919365Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5919435Z 2025-05-07T20:32:35.5919525Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5919532Z 2025-05-07T20:32:35.5919629Z moe/activation_test.py:117: 2025-05-07T20:32:35.5919760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5919864Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5919964Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5920341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.5920436Z return fn(*args, **kwargs) 2025-05-07T20:32:35.5920949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5921045Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5921422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5921648Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5921999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5922095Z kernel = self.compile( 2025-05-07T20:32:35.5922491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5922696Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5922848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5922853Z 2025-05-07T20:32:35.5923066Z self = 2025-05-07T20:32:35.5923870Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5924436Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917c7d9e0>} 2025-05-07T20:32:35.5925220Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5925412Z context = 2025-05-07T20:32:35.5925419Z 2025-05-07T20:32:35.5925586Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5925857Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5925962Z module_map=module_map) 2025-05-07T20:32:35.5926130Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5926228Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5926305Z E ^ 2025-05-07T20:32:35.5926674Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5926679Z 2025-05-07T20:32:35.5927231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5927236Z 2025-05-07T20:32:35.5927343Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5927611Z self=, 2025-05-07T20:32:35.5927689Z T=2048, 2025-05-07T20:32:35.5927766Z D=5120, 2025-05-07T20:32:35.5927848Z scale_ub=1200.0, 2025-05-07T20:32:35.5927931Z contiguous=False, 2025-05-07T20:32:35.5928018Z compiled=True, 2025-05-07T20:32:35.5928088Z ) 2025-05-07T20:32:35.5928353Z self = 2025-05-07T20:32:35.5928536Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.5928541Z 2025-05-07T20:32:35.5928620Z @given( 2025-05-07T20:32:35.5928741Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5928841Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5928955Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5929073Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5929185Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5929259Z ) 2025-05-07T20:32:35.5929512Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5929605Z def test_silu_mul_quant( 2025-05-07T20:32:35.5929681Z self, 2025-05-07T20:32:35.5929761Z T: int, 2025-05-07T20:32:35.5929836Z D: int, 2025-05-07T20:32:35.5929936Z scale_ub: Optional[float], 2025-05-07T20:32:35.5930028Z contiguous: bool, 2025-05-07T20:32:35.5930113Z compiled: bool, 2025-05-07T20:32:35.5930194Z ) -> None: 2025-05-07T20:32:35.5930286Z torch.manual_seed(2025) 2025-05-07T20:32:35.5930357Z 2025-05-07T20:32:35.5930527Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5930602Z 2025-05-07T20:32:35.5930696Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5930823Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5930910Z x = x_sign * x_clamp 2025-05-07T20:32:35.5930986Z x0 = x[:, :D] 2025-05-07T20:32:35.5931069Z x1 = x[:, D:] 2025-05-07T20:32:35.5931139Z 2025-05-07T20:32:35.5931220Z if contiguous: 2025-05-07T20:32:35.5931313Z x0 = x0.contiguous() 2025-05-07T20:32:35.5931400Z x1 = x1.contiguous() 2025-05-07T20:32:35.5931475Z 2025-05-07T20:32:35.5931565Z if scale_ub is not None: 2025-05-07T20:32:35.5931672Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5931863Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5931939Z ) 2025-05-07T20:32:35.5932013Z else: 2025-05-07T20:32:35.5932159Z scale_ub_tensor = None 2025-05-07T20:32:35.5932234Z 2025-05-07T20:32:35.5932363Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5932458Z op = silu_mul_quant 2025-05-07T20:32:35.5932541Z if compiled: 2025-05-07T20:32:35.5932638Z op = torch.compile(op) 2025-05-07T20:32:35.5932749Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5932822Z 2025-05-07T20:32:35.5932917Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5932921Z 2025-05-07T20:32:35.5933018Z moe/activation_test.py:117: 2025-05-07T20:32:35.5933149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5933253Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5933356Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5933734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.5933832Z return fn(*args, **kwargs) 2025-05-07T20:32:35.5934411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5934510Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5934879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5935146Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5935504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5935595Z kernel = self.compile( 2025-05-07T20:32:35.5935990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5936209Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5936342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5936346Z 2025-05-07T20:32:35.5936560Z self = 2025-05-07T20:32:35.5937365Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5937887Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917c7eb60>} 2025-05-07T20:32:35.5938665Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5938860Z context = 2025-05-07T20:32:35.5938867Z 2025-05-07T20:32:35.5939038Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5939310Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5939421Z module_map=module_map) 2025-05-07T20:32:35.5939583Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5939682Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5939759Z E ^ 2025-05-07T20:32:35.5940123Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5940127Z 2025-05-07T20:32:35.5940554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5940561Z 2025-05-07T20:32:35.5940666Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5940937Z self=, 2025-05-07T20:32:35.5941018Z T=4096, 2025-05-07T20:32:35.5941092Z D=5120, 2025-05-07T20:32:35.5941175Z scale_ub=1200.0, 2025-05-07T20:32:35.5941264Z contiguous=True, 2025-05-07T20:32:35.5941345Z compiled=True, 2025-05-07T20:32:35.5941416Z ) 2025-05-07T20:32:35.5941643Z self = 2025-05-07T20:32:35.5941820Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.5941825Z 2025-05-07T20:32:35.5941903Z @given( 2025-05-07T20:32:35.5942025Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5942123Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5942239Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5942357Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5942471Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5942551Z ) 2025-05-07T20:32:35.5942850Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5942941Z def test_silu_mul_quant( 2025-05-07T20:32:35.5943019Z self, 2025-05-07T20:32:35.5943137Z T: int, 2025-05-07T20:32:35.5943215Z D: int, 2025-05-07T20:32:35.5943316Z scale_ub: Optional[float], 2025-05-07T20:32:35.5943403Z contiguous: bool, 2025-05-07T20:32:35.5943525Z compiled: bool, 2025-05-07T20:32:35.5943604Z ) -> None: 2025-05-07T20:32:35.5943699Z torch.manual_seed(2025) 2025-05-07T20:32:35.5943771Z 2025-05-07T20:32:35.5943939Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5944011Z 2025-05-07T20:32:35.5944106Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5944269Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5944355Z x = x_sign * x_clamp 2025-05-07T20:32:35.5944436Z x0 = x[:, :D] 2025-05-07T20:32:35.5944515Z x1 = x[:, D:] 2025-05-07T20:32:35.5944590Z 2025-05-07T20:32:35.5944677Z if contiguous: 2025-05-07T20:32:35.5944765Z x0 = x0.contiguous() 2025-05-07T20:32:35.5944852Z x1 = x1.contiguous() 2025-05-07T20:32:35.5944926Z 2025-05-07T20:32:35.5945014Z if scale_ub is not None: 2025-05-07T20:32:35.5945122Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5945257Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5945334Z ) 2025-05-07T20:32:35.5945418Z else: 2025-05-07T20:32:35.5945509Z scale_ub_tensor = None 2025-05-07T20:32:35.5945579Z 2025-05-07T20:32:35.5945709Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5945798Z op = silu_mul_quant 2025-05-07T20:32:35.5945884Z if compiled: 2025-05-07T20:32:35.5945984Z op = torch.compile(op) 2025-05-07T20:32:35.5946087Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5946158Z 2025-05-07T20:32:35.5946254Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5946258Z 2025-05-07T20:32:35.5946354Z moe/activation_test.py:117: 2025-05-07T20:32:35.5946489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5946588Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5946685Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5947070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.5947163Z return fn(*args, **kwargs) 2025-05-07T20:32:35.5947670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5947774Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5948140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5948417Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5948766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5948862Z kernel = self.compile( 2025-05-07T20:32:35.5949260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5949439Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5949568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5949576Z 2025-05-07T20:32:35.5949783Z self = 2025-05-07T20:32:35.5950587Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5951113Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917880180>} 2025-05-07T20:32:35.5951926Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5952161Z context = 2025-05-07T20:32:35.5952166Z 2025-05-07T20:32:35.5952331Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5952598Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5952747Z module_map=module_map) 2025-05-07T20:32:35.5952909Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5953009Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5953088Z E ^ 2025-05-07T20:32:35.5953452Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5953458Z 2025-05-07T20:32:35.5953888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5953892Z 2025-05-07T20:32:35.5953999Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5954226Z self=, 2025-05-07T20:32:35.5954306Z T=128, 2025-05-07T20:32:35.5954381Z D=5120, 2025-05-07T20:32:35.5954468Z scale_ub=1200.0, 2025-05-07T20:32:35.5954553Z contiguous=False, 2025-05-07T20:32:35.5954642Z compiled=True, 2025-05-07T20:32:35.5954715Z ) 2025-05-07T20:32:35.5954938Z self = 2025-05-07T20:32:35.5955115Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.5955120Z 2025-05-07T20:32:35.5955200Z @given( 2025-05-07T20:32:35.5955317Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5955418Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5955534Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5955652Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5955775Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5955847Z ) 2025-05-07T20:32:35.5956100Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5956195Z def test_silu_mul_quant( 2025-05-07T20:32:35.5956270Z self, 2025-05-07T20:32:35.5956345Z T: int, 2025-05-07T20:32:35.5956425Z D: int, 2025-05-07T20:32:35.5956521Z scale_ub: Optional[float], 2025-05-07T20:32:35.5956609Z contiguous: bool, 2025-05-07T20:32:35.5956696Z compiled: bool, 2025-05-07T20:32:35.5956773Z ) -> None: 2025-05-07T20:32:35.5956911Z torch.manual_seed(2025) 2025-05-07T20:32:35.5956989Z 2025-05-07T20:32:35.5957159Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5957233Z 2025-05-07T20:32:35.5957324Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5957448Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5957538Z x = x_sign * x_clamp 2025-05-07T20:32:35.5957619Z x0 = x[:, :D] 2025-05-07T20:32:35.5957694Z x1 = x[:, D:] 2025-05-07T20:32:35.5957769Z 2025-05-07T20:32:35.5957852Z if contiguous: 2025-05-07T20:32:35.5957941Z x0 = x0.contiguous() 2025-05-07T20:32:35.5958031Z x1 = x1.contiguous() 2025-05-07T20:32:35.5958109Z 2025-05-07T20:32:35.5958197Z if scale_ub is not None: 2025-05-07T20:32:35.5958306Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5958441Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5958520Z ) 2025-05-07T20:32:35.5958595Z else: 2025-05-07T20:32:35.5958688Z scale_ub_tensor = None 2025-05-07T20:32:35.5958763Z 2025-05-07T20:32:35.5958934Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5959025Z op = silu_mul_quant 2025-05-07T20:32:35.5959109Z if compiled: 2025-05-07T20:32:35.5959206Z op = torch.compile(op) 2025-05-07T20:32:35.5959347Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5959421Z 2025-05-07T20:32:35.5959510Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5959514Z 2025-05-07T20:32:35.5959611Z moe/activation_test.py:117: 2025-05-07T20:32:35.5959743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5959880Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5959984Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5960364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.5960455Z return fn(*args, **kwargs) 2025-05-07T20:32:35.5960968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5961065Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5961429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5961665Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5962012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5962111Z kernel = self.compile( 2025-05-07T20:32:35.5962508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5962686Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5962818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5962823Z 2025-05-07T20:32:35.5963034Z self = 2025-05-07T20:32:35.5963890Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5964410Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917880ea0>} 2025-05-07T20:32:35.5965187Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5965452Z context = 2025-05-07T20:32:35.5965456Z 2025-05-07T20:32:35.5965625Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5965897Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5966002Z module_map=module_map) 2025-05-07T20:32:35.5966167Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5966266Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5966343Z E ^ 2025-05-07T20:32:35.5966711Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5966716Z 2025-05-07T20:32:35.5967145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5967149Z 2025-05-07T20:32:35.5967250Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5967485Z self=, 2025-05-07T20:32:35.5967563Z T=16384, 2025-05-07T20:32:35.5967638Z D=7168, 2025-05-07T20:32:35.5967771Z scale_ub=1200.0, 2025-05-07T20:32:35.5967858Z contiguous=True, 2025-05-07T20:32:35.5967941Z compiled=True, 2025-05-07T20:32:35.5968013Z ) 2025-05-07T20:32:35.5968237Z self = 2025-05-07T20:32:35.5968459Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.5968463Z 2025-05-07T20:32:35.5968540Z @given( 2025-05-07T20:32:35.5968659Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5968762Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5968913Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5969032Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5969148Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5969225Z ) 2025-05-07T20:32:35.5969481Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5969576Z def test_silu_mul_quant( 2025-05-07T20:32:35.5969652Z self, 2025-05-07T20:32:35.5969730Z T: int, 2025-05-07T20:32:35.5969806Z D: int, 2025-05-07T20:32:35.5969903Z scale_ub: Optional[float], 2025-05-07T20:32:35.5969998Z contiguous: bool, 2025-05-07T20:32:35.5970082Z compiled: bool, 2025-05-07T20:32:35.5970158Z ) -> None: 2025-05-07T20:32:35.5970254Z torch.manual_seed(2025) 2025-05-07T20:32:35.5970326Z 2025-05-07T20:32:35.5970495Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5970573Z 2025-05-07T20:32:35.5970665Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5970793Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5970881Z x = x_sign * x_clamp 2025-05-07T20:32:35.5970963Z x0 = x[:, :D] 2025-05-07T20:32:35.5971044Z x1 = x[:, D:] 2025-05-07T20:32:35.5971114Z 2025-05-07T20:32:35.5971196Z if contiguous: 2025-05-07T20:32:35.5971291Z x0 = x0.contiguous() 2025-05-07T20:32:35.5971378Z x1 = x1.contiguous() 2025-05-07T20:32:35.5971450Z 2025-05-07T20:32:35.5971544Z if scale_ub is not None: 2025-05-07T20:32:35.5971647Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5971784Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5971913Z ) 2025-05-07T20:32:35.5971991Z else: 2025-05-07T20:32:35.5972089Z scale_ub_tensor = None 2025-05-07T20:32:35.5972162Z 2025-05-07T20:32:35.5972296Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5972396Z op = silu_mul_quant 2025-05-07T20:32:35.5972480Z if compiled: 2025-05-07T20:32:35.5972583Z op = torch.compile(op) 2025-05-07T20:32:35.5972743Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5972824Z 2025-05-07T20:32:35.5972934Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5972938Z 2025-05-07T20:32:35.5973059Z moe/activation_test.py:117: 2025-05-07T20:32:35.5973205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5973308Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5973416Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5973853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.5973949Z return fn(*args, **kwargs) 2025-05-07T20:32:35.5974552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5974655Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5975148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5975393Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5975799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5975893Z kernel = self.compile( 2025-05-07T20:32:35.5976284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5976502Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5976630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5976635Z 2025-05-07T20:32:35.5976843Z self = 2025-05-07T20:32:35.5977693Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5978213Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89178820c0>} 2025-05-07T20:32:35.5978989Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5979184Z context = 2025-05-07T20:32:35.5979189Z 2025-05-07T20:32:35.5979359Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5979627Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5979738Z module_map=module_map) 2025-05-07T20:32:35.5979902Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5980003Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5980079Z E ^ 2025-05-07T20:32:35.5980451Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5980455Z 2025-05-07T20:32:35.5980881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5980887Z 2025-05-07T20:32:35.5980996Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5984625Z self=, 2025-05-07T20:32:35.5984716Z T=16384, 2025-05-07T20:32:35.5984799Z D=5120, 2025-05-07T20:32:35.5984882Z scale_ub=1200.0, 2025-05-07T20:32:35.5984970Z contiguous=True, 2025-05-07T20:32:35.5985058Z compiled=False, 2025-05-07T20:32:35.5985131Z ) 2025-05-07T20:32:35.5985365Z self = 2025-05-07T20:32:35.5985614Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.5985618Z 2025-05-07T20:32:35.5985698Z @given( 2025-05-07T20:32:35.5985825Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5985926Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5986039Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5986162Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5986275Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5986352Z ) 2025-05-07T20:32:35.5986608Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5986702Z def test_silu_mul_quant( 2025-05-07T20:32:35.5986783Z self, 2025-05-07T20:32:35.5986860Z T: int, 2025-05-07T20:32:35.5986935Z D: int, 2025-05-07T20:32:35.5987036Z scale_ub: Optional[float], 2025-05-07T20:32:35.5987126Z contiguous: bool, 2025-05-07T20:32:35.5987215Z compiled: bool, 2025-05-07T20:32:35.5987298Z ) -> None: 2025-05-07T20:32:35.5987395Z torch.manual_seed(2025) 2025-05-07T20:32:35.5987467Z 2025-05-07T20:32:35.5987698Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5987772Z 2025-05-07T20:32:35.5987867Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5987993Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5988125Z x = x_sign * x_clamp 2025-05-07T20:32:35.5988207Z x0 = x[:, :D] 2025-05-07T20:32:35.5988286Z x1 = x[:, D:] 2025-05-07T20:32:35.5988358Z 2025-05-07T20:32:35.5988444Z if contiguous: 2025-05-07T20:32:35.5988534Z x0 = x0.contiguous() 2025-05-07T20:32:35.5988664Z x1 = x1.contiguous() 2025-05-07T20:32:35.5988739Z 2025-05-07T20:32:35.5988828Z if scale_ub is not None: 2025-05-07T20:32:35.5988935Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5989075Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5989153Z ) 2025-05-07T20:32:35.5989233Z else: 2025-05-07T20:32:35.5989329Z scale_ub_tensor = None 2025-05-07T20:32:35.5989400Z 2025-05-07T20:32:35.5989533Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5989624Z op = silu_mul_quant 2025-05-07T20:32:35.5989712Z if compiled: 2025-05-07T20:32:35.5989814Z op = torch.compile(op) 2025-05-07T20:32:35.5989920Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5989992Z 2025-05-07T20:32:35.5990087Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5990091Z 2025-05-07T20:32:35.5990189Z moe/activation_test.py:117: 2025-05-07T20:32:35.5990324Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5990428Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5990526Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5991045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5991145Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5991513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5991742Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5992097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5992196Z kernel = self.compile( 2025-05-07T20:32:35.5992593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5992772Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5992904Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5992955Z 2025-05-07T20:32:35.5993164Z self = 2025-05-07T20:32:35.5993970Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5994488Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917881a80>} 2025-05-07T20:32:35.5995262Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5995460Z context = 2025-05-07T20:32:35.5995464Z 2025-05-07T20:32:35.5995636Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5995908Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5996057Z module_map=module_map) 2025-05-07T20:32:35.5996224Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5996327Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5996446Z E ^ 2025-05-07T20:32:35.5996811Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5996819Z 2025-05-07T20:32:35.5997244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5997313Z 2025-05-07T20:32:35.5997416Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5997646Z self=, 2025-05-07T20:32:35.5997726Z T=1, 2025-05-07T20:32:35.5997804Z D=7168, 2025-05-07T20:32:35.5997890Z scale_ub=1200.0, 2025-05-07T20:32:35.5997975Z contiguous=False, 2025-05-07T20:32:35.5998057Z compiled=False, 2025-05-07T20:32:35.5998133Z ) 2025-05-07T20:32:35.5998357Z self = 2025-05-07T20:32:35.5998531Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.5998538Z 2025-05-07T20:32:35.5998615Z @given( 2025-05-07T20:32:35.5998734Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5998838Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5998951Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5999068Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5999186Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5999261Z ) 2025-05-07T20:32:35.5999512Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5999609Z def test_silu_mul_quant( 2025-05-07T20:32:35.5999685Z self, 2025-05-07T20:32:35.5999767Z T: int, 2025-05-07T20:32:35.5999845Z D: int, 2025-05-07T20:32:35.5999945Z scale_ub: Optional[float], 2025-05-07T20:32:35.6000036Z contiguous: bool, 2025-05-07T20:32:35.6000121Z compiled: bool, 2025-05-07T20:32:35.6000201Z ) -> None: 2025-05-07T20:32:35.6000302Z torch.manual_seed(2025) 2025-05-07T20:32:35.6000374Z 2025-05-07T20:32:35.6000544Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6000623Z 2025-05-07T20:32:35.6000717Z x_sign = torch.sign(x) 2025-05-07T20:32:35.6000841Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.6000936Z x = x_sign * x_clamp 2025-05-07T20:32:35.6001017Z x0 = x[:, :D] 2025-05-07T20:32:35.6001100Z x1 = x[:, D:] 2025-05-07T20:32:35.6001172Z 2025-05-07T20:32:35.6001303Z if contiguous: 2025-05-07T20:32:35.6001400Z x0 = x0.contiguous() 2025-05-07T20:32:35.6001489Z x1 = x1.contiguous() 2025-05-07T20:32:35.6001561Z 2025-05-07T20:32:35.6001658Z if scale_ub is not None: 2025-05-07T20:32:35.6001763Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.6001900Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.6001981Z ) 2025-05-07T20:32:35.6002057Z else: 2025-05-07T20:32:35.6002150Z scale_ub_tensor = None 2025-05-07T20:32:35.6002226Z 2025-05-07T20:32:35.6002355Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.6002446Z op = silu_mul_quant 2025-05-07T20:32:35.6002533Z if compiled: 2025-05-07T20:32:35.6002634Z op = torch.compile(op) 2025-05-07T20:32:35.6002741Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.6002812Z 2025-05-07T20:32:35.6002902Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.6002909Z 2025-05-07T20:32:35.6003014Z moe/activation_test.py:117: 2025-05-07T20:32:35.6003192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6003311Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.6003425Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.6003959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.6004095Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.6004464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.6004690Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.6005080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.6005175Z kernel = self.compile( 2025-05-07T20:32:35.6005567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.6005749Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.6005882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6005886Z 2025-05-07T20:32:35.6006095Z self = 2025-05-07T20:32:35.6007240Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.6007772Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89177080e0>} 2025-05-07T20:32:35.6008549Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.6008746Z context = 2025-05-07T20:32:35.6008751Z 2025-05-07T20:32:35.6008921Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.6009192Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.6009302Z module_map=module_map) 2025-05-07T20:32:35.6009465Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.6009573Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.6009655Z E ^ 2025-05-07T20:32:35.6010020Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.6010025Z 2025-05-07T20:32:35.6010548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.6010553Z 2025-05-07T20:32:35.6010663Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6010892Z self=, 2025-05-07T20:32:35.6010972Z T=4096, 2025-05-07T20:32:35.6011051Z D=7168, 2025-05-07T20:32:35.6011138Z scale_ub=1200.0, 2025-05-07T20:32:35.6011234Z contiguous=False, 2025-05-07T20:32:35.6011318Z compiled=True, 2025-05-07T20:32:35.6011389Z ) 2025-05-07T20:32:35.6011616Z self = 2025-05-07T20:32:35.6011795Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.6011802Z 2025-05-07T20:32:35.6011940Z @given( 2025-05-07T20:32:35.6012067Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6012164Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6012284Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6012401Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6012577Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6012656Z ) 2025-05-07T20:32:35.6012907Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6012999Z def test_silu_mul_quant( 2025-05-07T20:32:35.6013133Z self, 2025-05-07T20:32:35.6013210Z T: int, 2025-05-07T20:32:35.6013286Z D: int, 2025-05-07T20:32:35.6013386Z scale_ub: Optional[float], 2025-05-07T20:32:35.6013492Z contiguous: bool, 2025-05-07T20:32:35.6013587Z compiled: bool, 2025-05-07T20:32:35.6013684Z ) -> None: 2025-05-07T20:32:35.6013846Z torch.manual_seed(2025) 2025-05-07T20:32:35.6013920Z 2025-05-07T20:32:35.6014089Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6014161Z 2025-05-07T20:32:35.6014257Z x_sign = torch.sign(x) 2025-05-07T20:32:35.6014381Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.6014469Z x = x_sign * x_clamp 2025-05-07T20:32:35.6014552Z x0 = x[:, :D] 2025-05-07T20:32:35.6014631Z x1 = x[:, D:] 2025-05-07T20:32:35.6014703Z 2025-05-07T20:32:35.6014788Z if contiguous: 2025-05-07T20:32:35.6014877Z x0 = x0.contiguous() 2025-05-07T20:32:35.6014967Z x1 = x1.contiguous() 2025-05-07T20:32:35.6015042Z 2025-05-07T20:32:35.6015130Z if scale_ub is not None: 2025-05-07T20:32:35.6015239Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.6015374Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.6015449Z ) 2025-05-07T20:32:35.6015526Z else: 2025-05-07T20:32:35.6015617Z scale_ub_tensor = None 2025-05-07T20:32:35.6015687Z 2025-05-07T20:32:35.6015820Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.6015910Z op = silu_mul_quant 2025-05-07T20:32:35.6015995Z if compiled: 2025-05-07T20:32:35.6016098Z op = torch.compile(op) 2025-05-07T20:32:35.6016204Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.6016277Z 2025-05-07T20:32:35.6016370Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.6016375Z 2025-05-07T20:32:35.6016470Z moe/activation_test.py:117: 2025-05-07T20:32:35.6016605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6016704Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.6016801Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.6017181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.6017275Z return fn(*args, **kwargs) 2025-05-07T20:32:35.6017832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.6017934Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.6018302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.6018533Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.6018882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.6018978Z kernel = self.compile( 2025-05-07T20:32:35.6019374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.6019551Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.6019684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6019692Z 2025-05-07T20:32:35.6019900Z self = 2025-05-07T20:32:35.6020747Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.6021267Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917709300>} 2025-05-07T20:32:35.6022080Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.6022277Z context = 2025-05-07T20:32:35.6022319Z 2025-05-07T20:32:35.6022488Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.6022764Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.6022873Z module_map=module_map) 2025-05-07T20:32:35.6023040Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.6023142Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.6023217Z E ^ 2025-05-07T20:32:35.6023586Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.6023593Z 2025-05-07T20:32:35.6024032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.6024036Z 2025-05-07T20:32:35.6024139Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6024372Z self=, 2025-05-07T20:32:35.6024453Z T=128, 2025-05-07T20:32:35.6024528Z D=7168, 2025-05-07T20:32:35.6024615Z scale_ub=1200.0, 2025-05-07T20:32:35.6024703Z contiguous=False, 2025-05-07T20:32:35.6024786Z compiled=True, 2025-05-07T20:32:35.6024859Z ) 2025-05-07T20:32:35.6025088Z self = 2025-05-07T20:32:35.6025265Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.6025270Z 2025-05-07T20:32:35.6025350Z @given( 2025-05-07T20:32:35.6025470Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6025568Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6025685Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6025803Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6025921Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6025996Z ) 2025-05-07T20:32:35.6026251Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6026344Z def test_silu_mul_quant( 2025-05-07T20:32:35.6026421Z self, 2025-05-07T20:32:35.6026541Z T: int, 2025-05-07T20:32:35.6026622Z D: int, 2025-05-07T20:32:35.6026719Z scale_ub: Optional[float], 2025-05-07T20:32:35.6026809Z contiguous: bool, 2025-05-07T20:32:35.6026896Z compiled: bool, 2025-05-07T20:32:35.6026974Z ) -> None: 2025-05-07T20:32:35.6027065Z torch.manual_seed(2025) 2025-05-07T20:32:35.6027138Z 2025-05-07T20:32:35.6027310Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6027385Z 2025-05-07T20:32:35.6027476Z x_sign = torch.sign(x) 2025-05-07T20:32:35.6027599Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.6027688Z x = x_sign * x_clamp 2025-05-07T20:32:35.6027767Z x0 = x[:, :D] 2025-05-07T20:32:35.6027848Z x1 = x[:, D:] 2025-05-07T20:32:35.6027919Z 2025-05-07T20:32:35.6028002Z if contiguous: 2025-05-07T20:32:35.6028091Z x0 = x0.contiguous() 2025-05-07T20:32:35.6028184Z x1 = x1.contiguous() 2025-05-07T20:32:35.6028256Z 2025-05-07T20:32:35.6028345Z if scale_ub is not None: 2025-05-07T20:32:35.6028452Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.6028633Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.6028714Z ) 2025-05-07T20:32:35.6028792Z else: 2025-05-07T20:32:35.6028884Z scale_ub_tensor = None 2025-05-07T20:32:35.6029024Z 2025-05-07T20:32:35.6029154Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.6029242Z op = silu_mul_quant 2025-05-07T20:32:35.6029330Z if compiled: 2025-05-07T20:32:35.6029426Z op = torch.compile(op) 2025-05-07T20:32:35.6029528Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.6029643Z 2025-05-07T20:32:35.6029731Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.6029736Z 2025-05-07T20:32:35.6029835Z moe/activation_test.py:117: 2025-05-07T20:32:35.6029975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6030073Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.6030177Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.6030563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.6030654Z return fn(*args, **kwargs) 2025-05-07T20:32:35.6031178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.6031274Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.6031645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.6031881Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.6032237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.6032336Z kernel = self.compile( 2025-05-07T20:32:35.6032735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.6032912Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.6033047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6033053Z 2025-05-07T20:32:35.6033262Z self = 2025-05-07T20:32:35.6034089Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.6034615Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f891770a020>} 2025-05-07T20:32:35.6035448Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.6035646Z context = 2025-05-07T20:32:35.6035650Z 2025-05-07T20:32:35.6035821Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.6036099Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.6036206Z module_map=module_map) 2025-05-07T20:32:35.6036369Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.6036473Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.6036549Z E ^ 2025-05-07T20:32:35.6036924Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.6036931Z 2025-05-07T20:32:35.6037367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.6037413Z 2025-05-07T20:32:35.6037516Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6037750Z self=, 2025-05-07T20:32:35.6037825Z T=2048, 2025-05-07T20:32:35.6037944Z D=7168, 2025-05-07T20:32:35.6038032Z scale_ub=None, 2025-05-07T20:32:35.6038116Z contiguous=True, 2025-05-07T20:32:35.6038200Z compiled=True, 2025-05-07T20:32:35.6038272Z ) 2025-05-07T20:32:35.6038499Z self = 2025-05-07T20:32:35.6038717Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.6038721Z 2025-05-07T20:32:35.6038795Z @given( 2025-05-07T20:32:35.6038913Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6039016Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6039131Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6039250Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6039366Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6039438Z ) 2025-05-07T20:32:35.6039698Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6039792Z def test_silu_mul_quant( 2025-05-07T20:32:35.6039868Z self, 2025-05-07T20:32:35.6039946Z T: int, 2025-05-07T20:32:35.6040021Z D: int, 2025-05-07T20:32:35.6040118Z scale_ub: Optional[float], 2025-05-07T20:32:35.6040210Z contiguous: bool, 2025-05-07T20:32:35.6040298Z compiled: bool, 2025-05-07T20:32:35.6040374Z ) -> None: 2025-05-07T20:32:35.6040470Z torch.manual_seed(2025) 2025-05-07T20:32:35.6040540Z 2025-05-07T20:32:35.6040711Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6040787Z 2025-05-07T20:32:35.6040877Z x_sign = torch.sign(x) 2025-05-07T20:32:35.6041003Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.6041093Z x = x_sign * x_clamp 2025-05-07T20:32:35.6041170Z x0 = x[:, :D] 2025-05-07T20:32:35.6041250Z x1 = x[:, D:] 2025-05-07T20:32:35.6041322Z 2025-05-07T20:32:35.6041407Z if contiguous: 2025-05-07T20:32:35.6041499Z x0 = x0.contiguous() 2025-05-07T20:32:35.6041585Z x1 = x1.contiguous() 2025-05-07T20:32:35.6041656Z 2025-05-07T20:32:35.6041749Z if scale_ub is not None: 2025-05-07T20:32:35.6041852Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.6041985Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.6042066Z ) 2025-05-07T20:32:35.6042141Z else: 2025-05-07T20:32:35.6042237Z scale_ub_tensor = None 2025-05-07T20:32:35.6042309Z 2025-05-07T20:32:35.6042485Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.6042577Z op = silu_mul_quant 2025-05-07T20:32:35.6042665Z if compiled: 2025-05-07T20:32:35.6042765Z op = torch.compile(op) 2025-05-07T20:32:35.6042871Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.6042942Z 2025-05-07T20:32:35.6043032Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.6043038Z 2025-05-07T20:32:35.6043136Z moe/activation_test.py:117: 2025-05-07T20:32:35.6043267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6043366Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.6043467Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.6043850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.6043943Z return fn(*args, **kwargs) 2025-05-07T20:32:35.6044462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.6044559Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.6044980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.6045211Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.6045607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.6045702Z kernel = self.compile( 2025-05-07T20:32:35.6046103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.6046320Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.6046450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6046454Z 2025-05-07T20:32:35.6046667Z self = 2025-05-07T20:32:35.6047485Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.6048013Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f891770b240>} 2025-05-07T20:32:35.6048808Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.6049005Z context = 2025-05-07T20:32:35.6049010Z 2025-05-07T20:32:35.6049179Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.6049456Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.6049563Z module_map=module_map) 2025-05-07T20:32:35.6049730Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.6049827Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.6049901Z E ^ 2025-05-07T20:32:35.6050276Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.6050281Z 2025-05-07T20:32:35.6050715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.6050719Z 2025-05-07T20:32:35.6050830Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6051061Z self=, 2025-05-07T20:32:35.6051137Z T=16384, 2025-05-07T20:32:35.6051214Z D=5120, 2025-05-07T20:32:35.6051337Z scale_ub=None, 2025-05-07T20:32:35.6051424Z contiguous=False, 2025-05-07T20:32:35.6051510Z compiled=False, 2025-05-07T20:32:35.6051581Z ) 2025-05-07T20:32:35.6051875Z self = 2025-05-07T20:32:35.6052068Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.6052073Z 2025-05-07T20:32:35.6052153Z @given( 2025-05-07T20:32:35.6052275Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6052373Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6052487Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6052607Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6052721Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6052795Z ) 2025-05-07T20:32:35.6053055Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6053150Z def test_silu_mul_quant( 2025-05-07T20:32:35.6053228Z self, 2025-05-07T20:32:35.6053308Z T: int, 2025-05-07T20:32:35.6053383Z D: int, 2025-05-07T20:32:35.6053553Z scale_ub: Optional[float], 2025-05-07T20:32:35.6053661Z contiguous: bool, 2025-05-07T20:32:35.6053752Z compiled: bool, 2025-05-07T20:32:35.6053830Z ) -> None: 2025-05-07T20:32:35.6053923Z torch.manual_seed(2025) 2025-05-07T20:32:35.6054033Z 2025-05-07T20:32:35.6054205Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6054277Z 2025-05-07T20:32:35.6054368Z x_sign = torch.sign(x) 2025-05-07T20:32:35.6054497Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.6056475Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.6056481Z 2025-05-07T20:32:35.6056603Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:35.6056610Z 2025-05-07T20:32:35.6056712Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6056945Z self=, 2025-05-07T20:32:35.6057021Z T=4096, 2025-05-07T20:32:35.6057095Z D=7168, 2025-05-07T20:32:35.6057179Z scale_ub=1200.0, 2025-05-07T20:32:35.6057264Z contiguous=True, 2025-05-07T20:32:35.6057348Z compiled=True, 2025-05-07T20:32:35.6057423Z ) 2025-05-07T20:32:35.6057649Z self = 2025-05-07T20:32:35.6057829Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.6057834Z 2025-05-07T20:32:35.6057910Z @given( 2025-05-07T20:32:35.6058031Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6058133Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6058247Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6058362Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6058482Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6058555Z ) 2025-05-07T20:32:35.6058812Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6058908Z def test_silu_mul_quant( 2025-05-07T20:32:35.6058991Z self, 2025-05-07T20:32:35.6059096Z T: int, 2025-05-07T20:32:35.6059208Z D: int, 2025-05-07T20:32:35.6059343Z scale_ub: Optional[float], 2025-05-07T20:32:35.6059434Z contiguous: bool, 2025-05-07T20:32:35.6059658Z compiled: bool, 2025-05-07T20:32:35.6059738Z ) -> None: 2025-05-07T20:32:35.6059835Z torch.manual_seed(2025) 2025-05-07T20:32:35.6059906Z 2025-05-07T20:32:35.6060078Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6060156Z 2025-05-07T20:32:35.6060247Z x_sign = torch.sign(x) 2025-05-07T20:32:35.6060372Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.6062294Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.6062303Z 2025-05-07T20:32:35.6062422Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:35.6062427Z 2025-05-07T20:32:35.6062599Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6062836Z self=, 2025-05-07T20:32:35.6062932Z T=16384, 2025-05-07T20:32:35.6063014Z D=7168, 2025-05-07T20:32:35.6063156Z scale_ub=None, 2025-05-07T20:32:35.6063246Z contiguous=False, 2025-05-07T20:32:35.6063329Z compiled=False, 2025-05-07T20:32:35.6063402Z ) 2025-05-07T20:32:35.6063628Z self = 2025-05-07T20:32:35.6063809Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.6063854Z 2025-05-07T20:32:35.6063931Z @given( 2025-05-07T20:32:35.6064054Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6064155Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6064267Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6064386Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6064502Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6064576Z ) 2025-05-07T20:32:35.6064831Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6064925Z def test_silu_mul_quant( 2025-05-07T20:32:35.6065005Z self, 2025-05-07T20:32:35.6065081Z T: int, 2025-05-07T20:32:35.6065155Z D: int, 2025-05-07T20:32:35.6065252Z scale_ub: Optional[float], 2025-05-07T20:32:35.6065339Z contiguous: bool, 2025-05-07T20:32:35.6065424Z compiled: bool, 2025-05-07T20:32:35.6065506Z ) -> None: 2025-05-07T20:32:35.6065601Z torch.manual_seed(2025) 2025-05-07T20:32:35.6065674Z 2025-05-07T20:32:35.6065845Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6067764Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.6067777Z 2025-05-07T20:32:35.6067896Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.6067901Z 2025-05-07T20:32:35.6068001Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6068236Z self=, 2025-05-07T20:32:35.6068311Z T=2048, 2025-05-07T20:32:35.6068385Z D=7168, 2025-05-07T20:32:35.6068473Z scale_ub=1200.0, 2025-05-07T20:32:35.6068602Z contiguous=True, 2025-05-07T20:32:35.6068685Z compiled=True, 2025-05-07T20:32:35.6068759Z ) 2025-05-07T20:32:35.6068986Z self = 2025-05-07T20:32:35.6069160Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.6069169Z 2025-05-07T20:32:35.6069244Z @given( 2025-05-07T20:32:35.6069364Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6069464Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6069576Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6069692Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6069808Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6069882Z ) 2025-05-07T20:32:35.6070136Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6070231Z def test_silu_mul_quant( 2025-05-07T20:32:35.6070309Z self, 2025-05-07T20:32:35.6070387Z T: int, 2025-05-07T20:32:35.6070464Z D: int, 2025-05-07T20:32:35.6070560Z scale_ub: Optional[float], 2025-05-07T20:32:35.6070695Z contiguous: bool, 2025-05-07T20:32:35.6070781Z compiled: bool, 2025-05-07T20:32:35.6070857Z ) -> None: 2025-05-07T20:32:35.6070952Z torch.manual_seed(2025) 2025-05-07T20:32:35.6071064Z 2025-05-07T20:32:35.6071233Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6071309Z 2025-05-07T20:32:35.6071402Z x_sign = torch.sign(x) 2025-05-07T20:32:35.6071525Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.6073435Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.6073479Z 2025-05-07T20:32:35.6073597Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:35.6073605Z 2025-05-07T20:32:35.6073709Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6073935Z self=, 2025-05-07T20:32:35.6074015Z T=2048, 2025-05-07T20:32:35.6074091Z D=7168, 2025-05-07T20:32:35.6074173Z scale_ub=None, 2025-05-07T20:32:35.6074259Z contiguous=True, 2025-05-07T20:32:35.6074344Z compiled=False, 2025-05-07T20:32:35.6074417Z ) 2025-05-07T20:32:35.6074644Z self = 2025-05-07T20:32:35.6074820Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.6074824Z 2025-05-07T20:32:35.6074898Z @given( 2025-05-07T20:32:35.6075021Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6075118Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6075234Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6075348Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6075462Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6075536Z ) 2025-05-07T20:32:35.6075785Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6075879Z def test_silu_mul_quant( 2025-05-07T20:32:35.6075957Z self, 2025-05-07T20:32:35.6076035Z T: int, 2025-05-07T20:32:35.6076110Z D: int, 2025-05-07T20:32:35.6076211Z scale_ub: Optional[float], 2025-05-07T20:32:35.6076299Z contiguous: bool, 2025-05-07T20:32:35.6076383Z compiled: bool, 2025-05-07T20:32:35.6076506Z ) -> None: 2025-05-07T20:32:35.6076601Z torch.manual_seed(2025) 2025-05-07T20:32:35.6076676Z 2025-05-07T20:32:35.6076846Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6076919Z 2025-05-07T20:32:35.6077015Z > x_sign = torch.sign(x) 2025-05-07T20:32:35.6078859Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.6078869Z 2025-05-07T20:32:35.6078993Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:35.6078998Z 2025-05-07T20:32:35.6079098Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6079365Z self=, 2025-05-07T20:32:35.6079444Z T=1, 2025-05-07T20:32:35.6079521Z D=7168, 2025-05-07T20:32:35.6079603Z scale_ub=1200.0, 2025-05-07T20:32:35.6079688Z contiguous=True, 2025-05-07T20:32:35.6079811Z compiled=False, 2025-05-07T20:32:35.6079886Z ) 2025-05-07T20:32:35.6080111Z self = 2025-05-07T20:32:35.6080280Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.6080285Z 2025-05-07T20:32:35.6080362Z @given( 2025-05-07T20:32:35.6080520Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6080617Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6080735Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6080854Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6080967Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6081041Z ) 2025-05-07T20:32:35.6081291Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6081388Z def test_silu_mul_quant( 2025-05-07T20:32:35.6081463Z self, 2025-05-07T20:32:35.6081539Z T: int, 2025-05-07T20:32:35.6081621Z D: int, 2025-05-07T20:32:35.6081718Z scale_ub: Optional[float], 2025-05-07T20:32:35.6081810Z contiguous: bool, 2025-05-07T20:32:35.6081898Z compiled: bool, 2025-05-07T20:32:35.6081975Z ) -> None: 2025-05-07T20:32:35.6082069Z torch.manual_seed(2025) 2025-05-07T20:32:35.6082148Z 2025-05-07T20:32:35.6082316Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6082389Z 2025-05-07T20:32:35.6082481Z x_sign = torch.sign(x) 2025-05-07T20:32:35.6082605Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.6082699Z x = x_sign * x_clamp 2025-05-07T20:32:35.6082790Z x0 = x[:, :D] 2025-05-07T20:32:35.6082882Z x1 = x[:, D:] 2025-05-07T20:32:35.6082970Z 2025-05-07T20:32:35.6083067Z if contiguous: 2025-05-07T20:32:35.6083156Z x0 = x0.contiguous() 2025-05-07T20:32:35.6083245Z x1 = x1.contiguous() 2025-05-07T20:32:35.6083320Z 2025-05-07T20:32:35.6083411Z if scale_ub is not None: 2025-05-07T20:32:35.6083517Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.6083651Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.6083726Z ) 2025-05-07T20:32:35.6083804Z else: 2025-05-07T20:32:35.6083900Z scale_ub_tensor = None 2025-05-07T20:32:35.6083970Z 2025-05-07T20:32:35.6084101Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.6084191Z op = silu_mul_quant 2025-05-07T20:32:35.6084325Z if compiled: 2025-05-07T20:32:35.6084425Z op = torch.compile(op) 2025-05-07T20:32:35.6084529Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.6084607Z 2025-05-07T20:32:35.6084697Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.6084702Z 2025-05-07T20:32:35.6084797Z moe/activation_test.py:117: 2025-05-07T20:32:35.6084936Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6085037Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.6085134Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.6085652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.6085750Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.6086125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.6086356Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.6086708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.6086850Z kernel = self.compile( 2025-05-07T20:32:35.6087247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.6087466Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.6087596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6087601Z 2025-05-07T20:32:35.6087807Z self = 2025-05-07T20:32:35.6088619Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.6089176Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917552520>} 2025-05-07T20:32:35.6089958Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.6090156Z context = 2025-05-07T20:32:35.6090161Z 2025-05-07T20:32:35.6090330Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.6090606Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.6090714Z module_map=module_map) 2025-05-07T20:32:35.6090882Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.6090981Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.6091058Z E ^ 2025-05-07T20:32:35.6091434Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.6091441Z 2025-05-07T20:32:35.6091951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.6091956Z 2025-05-07T20:32:35.6092067Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6092296Z self=, 2025-05-07T20:32:35.6092377Z T=128, 2025-05-07T20:32:35.6092455Z D=5120, 2025-05-07T20:32:35.6092536Z scale_ub=None, 2025-05-07T20:32:35.6092619Z contiguous=True, 2025-05-07T20:32:35.6092704Z compiled=False, 2025-05-07T20:32:35.6092778Z ) 2025-05-07T20:32:35.6093004Z self = 2025-05-07T20:32:35.6093183Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.6093258Z 2025-05-07T20:32:35.6093335Z @given( 2025-05-07T20:32:35.6093455Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6093557Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6093672Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6093791Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6093908Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6093980Z ) 2025-05-07T20:32:35.6094234Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6094327Z def test_silu_mul_quant( 2025-05-07T20:32:35.6094404Z self, 2025-05-07T20:32:35.6094483Z T: int, 2025-05-07T20:32:35.6094562Z D: int, 2025-05-07T20:32:35.6094658Z scale_ub: Optional[float], 2025-05-07T20:32:35.6094748Z contiguous: bool, 2025-05-07T20:32:35.6094832Z compiled: bool, 2025-05-07T20:32:35.6094917Z ) -> None: 2025-05-07T20:32:35.6095014Z torch.manual_seed(2025) 2025-05-07T20:32:35.6095086Z 2025-05-07T20:32:35.6095262Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6095378Z 2025-05-07T20:32:35.6095472Z x_sign = torch.sign(x) 2025-05-07T20:32:35.6095599Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.6095685Z x = x_sign * x_clamp 2025-05-07T20:32:35.6095803Z x0 = x[:, :D] 2025-05-07T20:32:35.6095885Z x1 = x[:, D:] 2025-05-07T20:32:35.6095955Z 2025-05-07T20:32:35.6096036Z if contiguous: 2025-05-07T20:32:35.6096129Z x0 = x0.contiguous() 2025-05-07T20:32:35.6096217Z x1 = x1.contiguous() 2025-05-07T20:32:35.6096332Z 2025-05-07T20:32:35.6096420Z if scale_ub is not None: 2025-05-07T20:32:35.6096524Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.6096663Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.6096741Z ) 2025-05-07T20:32:35.6096817Z else: 2025-05-07T20:32:35.6096912Z scale_ub_tensor = None 2025-05-07T20:32:35.6096983Z 2025-05-07T20:32:35.6097113Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.6097206Z op = silu_mul_quant 2025-05-07T20:32:35.6097289Z if compiled: 2025-05-07T20:32:35.6097389Z op = torch.compile(op) 2025-05-07T20:32:35.6097500Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.6097570Z 2025-05-07T20:32:35.6097661Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.6097665Z 2025-05-07T20:32:35.6097761Z moe/activation_test.py:117: 2025-05-07T20:32:35.6097893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6097996Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.6098095Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.6098617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.6098717Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.6099091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.6099324Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.6099685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.6099778Z kernel = self.compile( 2025-05-07T20:32:35.6100179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.6100357Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.6100489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6100496Z 2025-05-07T20:32:35.6100750Z self = 2025-05-07T20:32:35.6101581Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.6102110Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917553420>} 2025-05-07T20:32:35.6102900Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.6103101Z context = 2025-05-07T20:32:35.6103106Z 2025-05-07T20:32:35.6103276Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.6103551Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.6103660Z module_map=module_map) 2025-05-07T20:32:35.6103863Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.6103963Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.6104041Z E ^ 2025-05-07T20:32:35.6104412Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.6104454Z 2025-05-07T20:32:35.6104892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.6104896Z 2025-05-07T20:32:35.6104998Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6105270Z self=, 2025-05-07T20:32:35.6105352Z T=128, 2025-05-07T20:32:35.6105426Z D=7168, 2025-05-07T20:32:35.6105511Z scale_ub=None, 2025-05-07T20:32:35.6105598Z contiguous=True, 2025-05-07T20:32:35.6105681Z compiled=False, 2025-05-07T20:32:35.6105759Z ) 2025-05-07T20:32:35.6105987Z self = 2025-05-07T20:32:35.6106435Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.6106443Z 2025-05-07T20:32:35.6106560Z @given( 2025-05-07T20:32:35.6106689Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6106788Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6106909Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6110314Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6110452Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6110537Z ) 2025-05-07T20:32:35.6110796Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6110892Z def test_silu_mul_quant( 2025-05-07T20:32:35.6110970Z self, 2025-05-07T20:32:35.6111047Z T: int, 2025-05-07T20:32:35.6111128Z D: int, 2025-05-07T20:32:35.6111224Z scale_ub: Optional[float], 2025-05-07T20:32:35.6111317Z contiguous: bool, 2025-05-07T20:32:35.6111406Z compiled: bool, 2025-05-07T20:32:35.6111485Z ) -> None: 2025-05-07T20:32:35.6111582Z torch.manual_seed(2025) 2025-05-07T20:32:35.6111657Z 2025-05-07T20:32:35.6111835Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6111913Z 2025-05-07T20:32:35.6112005Z x_sign = torch.sign(x) 2025-05-07T20:32:35.6112130Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.6112220Z x = x_sign * x_clamp 2025-05-07T20:32:35.6112301Z x0 = x[:, :D] 2025-05-07T20:32:35.6112379Z x1 = x[:, D:] 2025-05-07T20:32:35.6112455Z 2025-05-07T20:32:35.6112537Z if contiguous: 2025-05-07T20:32:35.6112630Z x0 = x0.contiguous() 2025-05-07T20:32:35.6112851Z x1 = x1.contiguous() 2025-05-07T20:32:35.6112935Z 2025-05-07T20:32:35.6113040Z if scale_ub is not None: 2025-05-07T20:32:35.6113152Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.6113287Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.6113364Z ) 2025-05-07T20:32:35.6113439Z else: 2025-05-07T20:32:35.6113535Z scale_ub_tensor = None 2025-05-07T20:32:35.6113608Z 2025-05-07T20:32:35.6113737Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.6113825Z op = silu_mul_quant 2025-05-07T20:32:35.6113911Z if compiled: 2025-05-07T20:32:35.6114009Z op = torch.compile(op) 2025-05-07T20:32:35.6114119Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.6114194Z 2025-05-07T20:32:35.6114283Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.6114288Z 2025-05-07T20:32:35.6114389Z moe/activation_test.py:117: 2025-05-07T20:32:35.6114524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6114623Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.6114789Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.6115310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.6115461Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.6115837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.6116065Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.6116419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.6116576Z kernel = self.compile( 2025-05-07T20:32:35.6116973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.6117153Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.6117290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6117294Z 2025-05-07T20:32:35.6117501Z self = 2025-05-07T20:32:35.6118305Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.6118822Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89172984a0>} 2025-05-07T20:32:35.6119603Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.6119796Z context = 2025-05-07T20:32:35.6119803Z 2025-05-07T20:32:35.6119973Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.6120242Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.6120352Z module_map=module_map) 2025-05-07T20:32:35.6120518Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.6120616Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.6120691Z E ^ 2025-05-07T20:32:35.6121057Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.6121064Z 2025-05-07T20:32:35.6121535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.6121540Z 2025-05-07T20:32:35.6121648Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6121877Z self=, 2025-05-07T20:32:35.6121952Z T=2048, 2025-05-07T20:32:35.6122032Z D=7168, 2025-05-07T20:32:35.6122114Z scale_ub=1200.0, 2025-05-07T20:32:35.6122198Z contiguous=True, 2025-05-07T20:32:35.6122286Z compiled=False, 2025-05-07T20:32:35.6122360Z ) 2025-05-07T20:32:35.6122587Z self = 2025-05-07T20:32:35.6122780Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.6122786Z 2025-05-07T20:32:35.6122871Z @given( 2025-05-07T20:32:35.6123019Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6123117Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6123231Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6123354Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6123469Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6123543Z ) 2025-05-07T20:32:35.6123839Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6123934Z def test_silu_mul_quant( 2025-05-07T20:32:35.6124014Z self, 2025-05-07T20:32:35.6124091Z T: int, 2025-05-07T20:32:35.6124205Z D: int, 2025-05-07T20:32:35.6124305Z scale_ub: Optional[float], 2025-05-07T20:32:35.6124394Z contiguous: bool, 2025-05-07T20:32:35.6124479Z compiled: bool, 2025-05-07T20:32:35.6124563Z ) -> None: 2025-05-07T20:32:35.6124657Z torch.manual_seed(2025) 2025-05-07T20:32:35.6124797Z 2025-05-07T20:32:35.6124969Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6126830Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.6126839Z 2025-05-07T20:32:35.6126960Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.6126965Z 2025-05-07T20:32:35.6127064Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6127296Z self=, 2025-05-07T20:32:35.6127377Z T=1, 2025-05-07T20:32:35.6127453Z D=5120, 2025-05-07T20:32:35.6127540Z scale_ub=1200.0, 2025-05-07T20:32:35.6127624Z contiguous=True, 2025-05-07T20:32:35.6127707Z compiled=False, 2025-05-07T20:32:35.6127785Z ) 2025-05-07T20:32:35.6128012Z self = 2025-05-07T20:32:35.6128183Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.6128188Z 2025-05-07T20:32:35.6128269Z @given( 2025-05-07T20:32:35.6128387Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6128491Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6128605Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6128721Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6128840Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6128912Z ) 2025-05-07T20:32:35.6129161Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6129260Z def test_silu_mul_quant( 2025-05-07T20:32:35.6129337Z self, 2025-05-07T20:32:35.6129413Z T: int, 2025-05-07T20:32:35.6129492Z D: int, 2025-05-07T20:32:35.6129636Z scale_ub: Optional[float], 2025-05-07T20:32:35.6129728Z contiguous: bool, 2025-05-07T20:32:35.6129818Z compiled: bool, 2025-05-07T20:32:35.6129897Z ) -> None: 2025-05-07T20:32:35.6129994Z torch.manual_seed(2025) 2025-05-07T20:32:35.6130068Z 2025-05-07T20:32:35.6130236Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6130316Z 2025-05-07T20:32:35.6130408Z x_sign = torch.sign(x) 2025-05-07T20:32:35.6130532Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.6130623Z x = x_sign * x_clamp 2025-05-07T20:32:35.6130703Z x0 = x[:, :D] 2025-05-07T20:32:35.6130782Z x1 = x[:, D:] 2025-05-07T20:32:35.6130862Z 2025-05-07T20:32:35.6130947Z if contiguous: 2025-05-07T20:32:35.6131038Z x0 = x0.contiguous() 2025-05-07T20:32:35.6131131Z x1 = x1.contiguous() 2025-05-07T20:32:35.6131205Z 2025-05-07T20:32:35.6131301Z if scale_ub is not None: 2025-05-07T20:32:35.6131407Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.6131543Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.6131665Z ) 2025-05-07T20:32:35.6131742Z else: 2025-05-07T20:32:35.6131921Z scale_ub_tensor = None 2025-05-07T20:32:35.6131999Z 2025-05-07T20:32:35.6132130Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.6132269Z op = silu_mul_quant 2025-05-07T20:32:35.6132358Z if compiled: 2025-05-07T20:32:35.6132456Z op = torch.compile(op) 2025-05-07T20:32:35.6132561Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.6132637Z 2025-05-07T20:32:35.6132769Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.6132774Z 2025-05-07T20:32:35.6132873Z moe/activation_test.py:117: 2025-05-07T20:32:35.6133003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6133106Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.6133211Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.6133730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.6133826Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.6134197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.6134427Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.6134779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.6134875Z kernel = self.compile( 2025-05-07T20:32:35.6135271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.6135454Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.6135585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6135589Z 2025-05-07T20:32:35.6135797Z self = 2025-05-07T20:32:35.6136605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.6137125Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917299a80>} 2025-05-07T20:32:35.6137902Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.6138139Z context = 2025-05-07T20:32:35.6138144Z 2025-05-07T20:32:35.6138317Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.6138590Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.6138696Z module_map=module_map) 2025-05-07T20:32:35.6138864Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.6138964Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.6139039Z E ^ 2025-05-07T20:32:35.6139405Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.6139410Z 2025-05-07T20:32:35.6139838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.6139843Z 2025-05-07T20:32:35.6139948Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6140179Z self=, 2025-05-07T20:32:35.6140257Z T=2048, 2025-05-07T20:32:35.6140337Z D=5120, 2025-05-07T20:32:35.6140458Z scale_ub=None, 2025-05-07T20:32:35.6140545Z contiguous=True, 2025-05-07T20:32:35.6140633Z compiled=False, 2025-05-07T20:32:35.6140704Z ) 2025-05-07T20:32:35.6140931Z self = 2025-05-07T20:32:35.6141148Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.6141152Z 2025-05-07T20:32:35.6141226Z @given( 2025-05-07T20:32:35.6141350Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6141447Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6141600Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6141721Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6141832Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6141911Z ) 2025-05-07T20:32:35.6142161Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6142254Z def test_silu_mul_quant( 2025-05-07T20:32:35.6142332Z self, 2025-05-07T20:32:35.6142407Z T: int, 2025-05-07T20:32:35.6142482Z D: int, 2025-05-07T20:32:35.6142582Z scale_ub: Optional[float], 2025-05-07T20:32:35.6142674Z contiguous: bool, 2025-05-07T20:32:35.6142759Z compiled: bool, 2025-05-07T20:32:35.6142838Z ) -> None: 2025-05-07T20:32:35.6142931Z torch.manual_seed(2025) 2025-05-07T20:32:35.6143002Z 2025-05-07T20:32:35.6143173Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6143250Z 2025-05-07T20:32:35.6143365Z > x_sign = torch.sign(x) 2025-05-07T20:32:35.6145258Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.6145267Z 2025-05-07T20:32:35.6145389Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:35.6145394Z 2025-05-07T20:32:35.6145496Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6145725Z self=, 2025-05-07T20:32:35.6145802Z T=16384, 2025-05-07T20:32:35.6145880Z D=5120, 2025-05-07T20:32:35.6145961Z scale_ub=None, 2025-05-07T20:32:35.6146047Z contiguous=True, 2025-05-07T20:32:35.6146129Z compiled=False, 2025-05-07T20:32:35.6146200Z ) 2025-05-07T20:32:35.6146473Z self = 2025-05-07T20:32:35.6146654Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.6146661Z 2025-05-07T20:32:35.6146741Z @given( 2025-05-07T20:32:35.6146858Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6146957Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6147076Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6147190Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6147302Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6147377Z ) 2025-05-07T20:32:35.6147625Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6147721Z def test_silu_mul_quant( 2025-05-07T20:32:35.6147800Z self, 2025-05-07T20:32:35.6147877Z T: int, 2025-05-07T20:32:35.6147952Z D: int, 2025-05-07T20:32:35.6148056Z scale_ub: Optional[float], 2025-05-07T20:32:35.6148145Z contiguous: bool, 2025-05-07T20:32:35.6148234Z compiled: bool, 2025-05-07T20:32:35.6148310Z ) -> None: 2025-05-07T20:32:35.6148451Z torch.manual_seed(2025) 2025-05-07T20:32:35.6148530Z 2025-05-07T20:32:35.6148697Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6150547Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.6150629Z 2025-05-07T20:32:35.6150750Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.6150755Z 2025-05-07T20:32:35.6150856Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6151087Z self=, 2025-05-07T20:32:35.6151162Z T=4096, 2025-05-07T20:32:35.6151237Z D=5120, 2025-05-07T20:32:35.6151321Z scale_ub=None, 2025-05-07T20:32:35.6151407Z contiguous=True, 2025-05-07T20:32:35.6151493Z compiled=False, 2025-05-07T20:32:35.6151565Z ) 2025-05-07T20:32:35.6151787Z self = 2025-05-07T20:32:35.6151964Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.6151968Z 2025-05-07T20:32:35.6152046Z @given( 2025-05-07T20:32:35.6152163Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6152264Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6152378Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6152495Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6152611Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6152686Z ) 2025-05-07T20:32:35.6152940Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6153032Z def test_silu_mul_quant( 2025-05-07T20:32:35.6153110Z self, 2025-05-07T20:32:35.6153190Z T: int, 2025-05-07T20:32:35.6153267Z D: int, 2025-05-07T20:32:35.6153365Z scale_ub: Optional[float], 2025-05-07T20:32:35.6153456Z contiguous: bool, 2025-05-07T20:32:35.6153541Z compiled: bool, 2025-05-07T20:32:35.6153619Z ) -> None: 2025-05-07T20:32:35.6153715Z torch.manual_seed(2025) 2025-05-07T20:32:35.6153789Z 2025-05-07T20:32:35.6153956Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6155857Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.6155865Z 2025-05-07T20:32:35.6155988Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.6155992Z 2025-05-07T20:32:35.6156091Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6156317Z self=, 2025-05-07T20:32:35.6156397Z T=2048, 2025-05-07T20:32:35.6156471Z D=5120, 2025-05-07T20:32:35.6156552Z scale_ub=None, 2025-05-07T20:32:35.6156640Z contiguous=False, 2025-05-07T20:32:35.6156725Z compiled=False, 2025-05-07T20:32:35.6156796Z ) 2025-05-07T20:32:35.6157022Z self = 2025-05-07T20:32:35.6157240Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.6157245Z 2025-05-07T20:32:35.6157325Z @given( 2025-05-07T20:32:35.6157443Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6157600Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6157716Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6157832Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6157944Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6158021Z ) 2025-05-07T20:32:35.6158313Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6158407Z def test_silu_mul_quant( 2025-05-07T20:32:35.6158486Z self, 2025-05-07T20:32:35.6158562Z T: int, 2025-05-07T20:32:35.6158640Z D: int, 2025-05-07T20:32:35.6158741Z scale_ub: Optional[float], 2025-05-07T20:32:35.6158831Z contiguous: bool, 2025-05-07T20:32:35.6158924Z compiled: bool, 2025-05-07T20:32:35.6159001Z ) -> None: 2025-05-07T20:32:35.6159096Z torch.manual_seed(2025) 2025-05-07T20:32:35.6159173Z 2025-05-07T20:32:35.6159340Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6161314Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.6161330Z 2025-05-07T20:32:35.6161449Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.6161454Z 2025-05-07T20:32:35.6161558Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6161822Z self=, 2025-05-07T20:32:35.6161928Z T=4096, 2025-05-07T20:32:35.6162039Z D=7168, 2025-05-07T20:32:35.6162165Z scale_ub=None, 2025-05-07T20:32:35.6162256Z contiguous=True, 2025-05-07T20:32:35.6162345Z compiled=True, 2025-05-07T20:32:35.6162419Z ) 2025-05-07T20:32:35.6162644Z self = 2025-05-07T20:32:35.6162819Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.6162826Z 2025-05-07T20:32:35.6162901Z @given( 2025-05-07T20:32:35.6163018Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6163123Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6163296Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6163414Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6163541Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6163632Z ) 2025-05-07T20:32:35.6163910Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6164006Z def test_silu_mul_quant( 2025-05-07T20:32:35.6164083Z self, 2025-05-07T20:32:35.6164163Z T: int, 2025-05-07T20:32:35.6164239Z D: int, 2025-05-07T20:32:35.6164336Z scale_ub: Optional[float], 2025-05-07T20:32:35.6164429Z contiguous: bool, 2025-05-07T20:32:35.6164516Z compiled: bool, 2025-05-07T20:32:35.6164598Z ) -> None: 2025-05-07T20:32:35.6164695Z torch.manual_seed(2025) 2025-05-07T20:32:35.6164768Z 2025-05-07T20:32:35.6164934Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6166830Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.6166872Z 2025-05-07T20:32:35.6166994Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.6166999Z 2025-05-07T20:32:35.6167100Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6167368Z self=, 2025-05-07T20:32:35.6167453Z T=2048, 2025-05-07T20:32:35.6167529Z D=5120, 2025-05-07T20:32:35.6167610Z scale_ub=1200.0, 2025-05-07T20:32:35.6167703Z contiguous=False, 2025-05-07T20:32:35.6167786Z compiled=False, 2025-05-07T20:32:35.6167857Z ) 2025-05-07T20:32:35.6168083Z self = 2025-05-07T20:32:35.6168261Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.6168265Z 2025-05-07T20:32:35.6168345Z @given( 2025-05-07T20:32:35.6168464Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6168560Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6168676Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6168792Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6168904Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6168981Z ) 2025-05-07T20:32:35.6169231Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6169324Z def test_silu_mul_quant( 2025-05-07T20:32:35.6169404Z self, 2025-05-07T20:32:35.6169480Z T: int, 2025-05-07T20:32:35.6169556Z D: int, 2025-05-07T20:32:35.6169655Z scale_ub: Optional[float], 2025-05-07T20:32:35.6169745Z contiguous: bool, 2025-05-07T20:32:35.6169833Z compiled: bool, 2025-05-07T20:32:35.6169910Z ) -> None: 2025-05-07T20:32:35.6170002Z torch.manual_seed(2025) 2025-05-07T20:32:35.6170080Z 2025-05-07T20:32:35.6170247Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6172200Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.6172214Z 2025-05-07T20:32:35.6172333Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.6172338Z 2025-05-07T20:32:35.6172439Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6172667Z self=, 2025-05-07T20:32:35.6172745Z T=4096, 2025-05-07T20:32:35.6172821Z D=7168, 2025-05-07T20:32:35.6172908Z scale_ub=1200.0, 2025-05-07T20:32:35.6172989Z contiguous=True, 2025-05-07T20:32:35.6173073Z compiled=False, 2025-05-07T20:32:35.6173151Z ) 2025-05-07T20:32:35.6173398Z self = 2025-05-07T20:32:35.6173602Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.6173606Z 2025-05-07T20:32:35.6173682Z @given( 2025-05-07T20:32:35.6173801Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6173901Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6174016Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6174175Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6174291Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6174364Z ) 2025-05-07T20:32:35.6174612Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6174755Z def test_silu_mul_quant( 2025-05-07T20:32:35.6174832Z self, 2025-05-07T20:32:35.6174910Z T: int, 2025-05-07T20:32:35.6174986Z D: int, 2025-05-07T20:32:35.6175080Z scale_ub: Optional[float], 2025-05-07T20:32:35.6175170Z contiguous: bool, 2025-05-07T20:32:35.6175294Z compiled: bool, 2025-05-07T20:32:35.6175371Z ) -> None: 2025-05-07T20:32:35.6175467Z torch.manual_seed(2025) 2025-05-07T20:32:35.6175538Z 2025-05-07T20:32:35.6175707Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6177566Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.6177574Z 2025-05-07T20:32:35.6177692Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.6177699Z 2025-05-07T20:32:35.6177802Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6178027Z self=, 2025-05-07T20:32:35.6178106Z T=16384, 2025-05-07T20:32:35.6178183Z D=7168, 2025-05-07T20:32:35.6178263Z scale_ub=None, 2025-05-07T20:32:35.6178349Z contiguous=False, 2025-05-07T20:32:35.6178430Z compiled=True, 2025-05-07T20:32:35.6178505Z ) 2025-05-07T20:32:35.6178729Z self = 2025-05-07T20:32:35.6178907Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.6178914Z 2025-05-07T20:32:35.6178990Z @given( 2025-05-07T20:32:35.6179109Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6179204Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6179318Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6179432Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6179545Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6179620Z ) 2025-05-07T20:32:35.6179870Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6180007Z def test_silu_mul_quant( 2025-05-07T20:32:35.6180090Z self, 2025-05-07T20:32:35.6180168Z T: int, 2025-05-07T20:32:35.6180241Z D: int, 2025-05-07T20:32:35.6180343Z scale_ub: Optional[float], 2025-05-07T20:32:35.6180431Z contiguous: bool, 2025-05-07T20:32:35.6180515Z compiled: bool, 2025-05-07T20:32:35.6180596Z ) -> None: 2025-05-07T20:32:35.6180693Z torch.manual_seed(2025) 2025-05-07T20:32:35.6180769Z 2025-05-07T20:32:35.6180937Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6182785Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.6182839Z 2025-05-07T20:32:35.6182963Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.6182967Z 2025-05-07T20:32:35.6183072Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6183368Z self=, 2025-05-07T20:32:35.6183444Z T=4096, 2025-05-07T20:32:35.6183519Z D=7168, 2025-05-07T20:32:35.6183603Z scale_ub=None, 2025-05-07T20:32:35.6183687Z contiguous=True, 2025-05-07T20:32:35.6183771Z compiled=False, 2025-05-07T20:32:35.6183845Z ) 2025-05-07T20:32:35.6184136Z self = 2025-05-07T20:32:35.6184327Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.6184332Z 2025-05-07T20:32:35.6184411Z @given( 2025-05-07T20:32:35.6184533Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6184636Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6184758Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6184882Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6185003Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6185078Z ) 2025-05-07T20:32:35.6185368Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6185462Z def test_silu_mul_quant( 2025-05-07T20:32:35.6185538Z self, 2025-05-07T20:32:35.6185619Z T: int, 2025-05-07T20:32:35.6185694Z D: int, 2025-05-07T20:32:35.6185796Z scale_ub: Optional[float], 2025-05-07T20:32:35.6185890Z contiguous: bool, 2025-05-07T20:32:35.6185976Z compiled: bool, 2025-05-07T20:32:35.6186054Z ) -> None: 2025-05-07T20:32:35.6186152Z torch.manual_seed(2025) 2025-05-07T20:32:35.6186226Z 2025-05-07T20:32:35.6186407Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6188689Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.6188699Z 2025-05-07T20:32:35.6188822Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.6188827Z 2025-05-07T20:32:35.6188935Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6189232Z self=, 2025-05-07T20:32:35.6189312Z T=16384, 2025-05-07T20:32:35.6189389Z D=7168, 2025-05-07T20:32:35.6189471Z scale_ub=None, 2025-05-07T20:32:35.6189561Z contiguous=True, 2025-05-07T20:32:35.6189645Z compiled=False, 2025-05-07T20:32:35.6189718Z ) 2025-05-07T20:32:35.6189970Z self = 2025-05-07T20:32:35.6190165Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.6190170Z 2025-05-07T20:32:35.6190247Z @given( 2025-05-07T20:32:35.6190373Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6190472Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6190597Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6190723Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6190842Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6190920Z ) 2025-05-07T20:32:35.6191206Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6191304Z def test_silu_mul_quant( 2025-05-07T20:32:35.6191383Z self, 2025-05-07T20:32:35.6191527Z T: int, 2025-05-07T20:32:35.6191603Z D: int, 2025-05-07T20:32:35.6191702Z scale_ub: Optional[float], 2025-05-07T20:32:35.6191789Z contiguous: bool, 2025-05-07T20:32:35.6191912Z compiled: bool, 2025-05-07T20:32:35.6191993Z ) -> None: 2025-05-07T20:32:35.6192087Z torch.manual_seed(2025) 2025-05-07T20:32:35.6192163Z 2025-05-07T20:32:35.6192330Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6194234Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.6194357Z 2025-05-07T20:32:35.6194473Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.6194478Z 2025-05-07T20:32:35.6194581Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6194809Z self=, 2025-05-07T20:32:35.6194885Z T=16384, 2025-05-07T20:32:35.6194962Z D=7168, 2025-05-07T20:32:35.6195046Z scale_ub=1200.0, 2025-05-07T20:32:35.6195129Z contiguous=True, 2025-05-07T20:32:35.6195214Z compiled=False, 2025-05-07T20:32:35.6195289Z ) 2025-05-07T20:32:35.6195508Z self = 2025-05-07T20:32:35.6195693Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.6195697Z 2025-05-07T20:32:35.6195772Z @given( 2025-05-07T20:32:35.6195890Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6195990Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6196104Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6196220Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6196336Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6196408Z ) 2025-05-07T20:32:35.6196660Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6196754Z def test_silu_mul_quant( 2025-05-07T20:32:35.6196830Z self, 2025-05-07T20:32:35.6196915Z T: int, 2025-05-07T20:32:35.6196989Z D: int, 2025-05-07T20:32:35.6197084Z scale_ub: Optional[float], 2025-05-07T20:32:35.6197176Z contiguous: bool, 2025-05-07T20:32:35.6197260Z compiled: bool, 2025-05-07T20:32:35.6197379Z ) -> None: 2025-05-07T20:32:35.6197475Z torch.manual_seed(2025) 2025-05-07T20:32:35.6197546Z 2025-05-07T20:32:35.6197715Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6199566Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.6199578Z 2025-05-07T20:32:35.6199692Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.6199696Z 2025-05-07T20:32:35.6199799Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6200028Z self=, 2025-05-07T20:32:35.6200108Z T=128, 2025-05-07T20:32:35.6200187Z D=5120, 2025-05-07T20:32:35.6200311Z scale_ub=1200.0, 2025-05-07T20:32:35.6200400Z contiguous=False, 2025-05-07T20:32:35.6200483Z compiled=False, 2025-05-07T20:32:35.6200555Z ) 2025-05-07T20:32:35.6200777Z self = 2025-05-07T20:32:35.6200989Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.6200994Z 2025-05-07T20:32:35.6201070Z @given( 2025-05-07T20:32:35.6201189Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6201287Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6201443Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6201557Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6201673Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6201748Z ) 2025-05-07T20:32:35.6201997Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6202091Z def test_silu_mul_quant( 2025-05-07T20:32:35.6202170Z self, 2025-05-07T20:32:35.6202247Z T: int, 2025-05-07T20:32:35.6202323Z D: int, 2025-05-07T20:32:35.6202423Z scale_ub: Optional[float], 2025-05-07T20:32:35.6202514Z contiguous: bool, 2025-05-07T20:32:35.6202597Z compiled: bool, 2025-05-07T20:32:35.6202677Z ) -> None: 2025-05-07T20:32:35.6202769Z torch.manual_seed(2025) 2025-05-07T20:32:35.6202842Z 2025-05-07T20:32:35.6203009Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6203085Z 2025-05-07T20:32:35.6203179Z x_sign = torch.sign(x) 2025-05-07T20:32:35.6203303Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.6203390Z x = x_sign * x_clamp 2025-05-07T20:32:35.6203473Z x0 = x[:, :D] 2025-05-07T20:32:35.6203552Z x1 = x[:, D:] 2025-05-07T20:32:35.6203622Z 2025-05-07T20:32:35.6203710Z if contiguous: 2025-05-07T20:32:35.6203802Z x0 = x0.contiguous() 2025-05-07T20:32:35.6203892Z x1 = x1.contiguous() 2025-05-07T20:32:35.6203965Z 2025-05-07T20:32:35.6204055Z if scale_ub is not None: 2025-05-07T20:32:35.6204169Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.6204304Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.6204378Z ) 2025-05-07T20:32:35.6204457Z else: 2025-05-07T20:32:35.6204552Z scale_ub_tensor = None 2025-05-07T20:32:35.6204624Z 2025-05-07T20:32:35.6204763Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.6204852Z op = silu_mul_quant 2025-05-07T20:32:35.6204936Z if compiled: 2025-05-07T20:32:35.6205038Z op = torch.compile(op) 2025-05-07T20:32:35.6205190Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.6205263Z 2025-05-07T20:32:35.6205355Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.6205359Z 2025-05-07T20:32:35.6205459Z moe/activation_test.py:117: 2025-05-07T20:32:35.6205593Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6205693Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.6205795Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.6206566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.6206668Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.6207041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.6207277Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.6207630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.6207727Z kernel = self.compile( 2025-05-07T20:32:35.6208216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.6208396Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.6208528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6208590Z 2025-05-07T20:32:35.6208798Z self = 2025-05-07T20:32:35.6209612Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.6210199Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89173e07c0>} 2025-05-07T20:32:35.6210975Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.6211171Z context = 2025-05-07T20:32:35.6211178Z 2025-05-07T20:32:35.6211346Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.6211622Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.6211728Z module_map=module_map) 2025-05-07T20:32:35.6211958Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.6212062Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.6212139Z E ^ 2025-05-07T20:32:35.6212511Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.6212516Z 2025-05-07T20:32:35.6212996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.6213001Z 2025-05-07T20:32:35.6213101Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6213333Z self=, 2025-05-07T20:32:35.6213411Z T=2048, 2025-05-07T20:32:35.6213487Z D=7168, 2025-05-07T20:32:35.6213572Z scale_ub=None, 2025-05-07T20:32:35.6213658Z contiguous=False, 2025-05-07T20:32:35.6213745Z compiled=False, 2025-05-07T20:32:35.6213817Z ) 2025-05-07T20:32:35.6214042Z self = 2025-05-07T20:32:35.6214224Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.6214229Z 2025-05-07T20:32:35.6214304Z @given( 2025-05-07T20:32:35.6214491Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6214593Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6214709Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6214823Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6214939Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6215010Z ) 2025-05-07T20:32:35.6215270Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6215363Z def test_silu_mul_quant( 2025-05-07T20:32:35.6215438Z self, 2025-05-07T20:32:35.6215517Z T: int, 2025-05-07T20:32:35.6215592Z D: int, 2025-05-07T20:32:35.6215689Z scale_ub: Optional[float], 2025-05-07T20:32:35.6215783Z contiguous: bool, 2025-05-07T20:32:35.6215868Z compiled: bool, 2025-05-07T20:32:35.6215947Z ) -> None: 2025-05-07T20:32:35.6216042Z torch.manual_seed(2025) 2025-05-07T20:32:35.6216114Z 2025-05-07T20:32:35.6216283Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6218193Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.6218237Z 2025-05-07T20:32:35.6218355Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.6218402Z 2025-05-07T20:32:35.6218505Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6218732Z self=, 2025-05-07T20:32:35.6218815Z T=128, 2025-05-07T20:32:35.6218890Z D=7168, 2025-05-07T20:32:35.6218973Z scale_ub=1200.0, 2025-05-07T20:32:35.6219062Z contiguous=True, 2025-05-07T20:32:35.6219147Z compiled=True, 2025-05-07T20:32:35.6219218Z ) 2025-05-07T20:32:35.6219444Z self = 2025-05-07T20:32:35.6219615Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.6219622Z 2025-05-07T20:32:35.6219696Z @given( 2025-05-07T20:32:35.6219819Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6219916Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6220032Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6220151Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6220263Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6220340Z ) 2025-05-07T20:32:35.6220593Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6220684Z def test_silu_mul_quant( 2025-05-07T20:32:35.6220761Z self, 2025-05-07T20:32:35.6220839Z T: int, 2025-05-07T20:32:35.6220915Z D: int, 2025-05-07T20:32:35.6221016Z scale_ub: Optional[float], 2025-05-07T20:32:35.6221104Z contiguous: bool, 2025-05-07T20:32:35.6221189Z compiled: bool, 2025-05-07T20:32:35.6221268Z ) -> None: 2025-05-07T20:32:35.6221360Z torch.manual_seed(2025) 2025-05-07T20:32:35.6221433Z 2025-05-07T20:32:35.6221601Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6221674Z 2025-05-07T20:32:35.6221768Z x_sign = torch.sign(x) 2025-05-07T20:32:35.6221890Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.6221981Z x = x_sign * x_clamp 2025-05-07T20:32:35.6222063Z x0 = x[:, :D] 2025-05-07T20:32:35.6222142Z x1 = x[:, D:] 2025-05-07T20:32:35.6222214Z 2025-05-07T20:32:35.6222370Z if contiguous: 2025-05-07T20:32:35.6222462Z x0 = x0.contiguous() 2025-05-07T20:32:35.6222549Z x1 = x1.contiguous() 2025-05-07T20:32:35.6222625Z 2025-05-07T20:32:35.6222735Z if scale_ub is not None: 2025-05-07T20:32:35.6222853Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.6223007Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.6223084Z ) 2025-05-07T20:32:35.6223166Z else: 2025-05-07T20:32:35.6223262Z scale_ub_tensor = None 2025-05-07T20:32:35.6223333Z 2025-05-07T20:32:35.6223466Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.6223555Z op = silu_mul_quant 2025-05-07T20:32:35.6223642Z if compiled: 2025-05-07T20:32:35.6223744Z op = torch.compile(op) 2025-05-07T20:32:35.6223848Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.6223919Z 2025-05-07T20:32:35.6224014Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.6224018Z 2025-05-07T20:32:35.6224112Z moe/activation_test.py:117: 2025-05-07T20:32:35.6224291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6224393Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.6224492Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.6224877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.6225009Z return fn(*args, **kwargs) 2025-05-07T20:32:35.6225519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.6225620Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.6226026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.6226261Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.6226611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.6226707Z kernel = self.compile( 2025-05-07T20:32:35.6227104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.6227280Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.6227416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6227420Z 2025-05-07T20:32:35.6227628Z self = 2025-05-07T20:32:35.6228431Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.6228955Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89173e1940>} 2025-05-07T20:32:35.6229732Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.6229928Z context = 2025-05-07T20:32:35.6229933Z 2025-05-07T20:32:35.6230101Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.6230370Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.6230480Z module_map=module_map) 2025-05-07T20:32:35.6230642Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.6230742Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.6230817Z E ^ 2025-05-07T20:32:35.6231224Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.6231229Z 2025-05-07T20:32:35.6231663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.6231667Z 2025-05-07T20:32:35.6231769Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6232000Z self=, 2025-05-07T20:32:35.6232076Z T=128, 2025-05-07T20:32:35.6232150Z D=7168, 2025-05-07T20:32:35.6232236Z scale_ub=1200.0, 2025-05-07T20:32:35.6232319Z contiguous=True, 2025-05-07T20:32:35.6232402Z compiled=False, 2025-05-07T20:32:35.6232478Z ) 2025-05-07T20:32:35.6232701Z self = 2025-05-07T20:32:35.6232876Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.6232880Z 2025-05-07T20:32:35.6232960Z @given( 2025-05-07T20:32:35.6233077Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6233224Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6233363Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6233491Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6233619Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6233733Z ) 2025-05-07T20:32:35.6233984Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6234080Z def test_silu_mul_quant( 2025-05-07T20:32:35.6234158Z self, 2025-05-07T20:32:35.6234234Z T: int, 2025-05-07T20:32:35.6234313Z D: int, 2025-05-07T20:32:35.6234454Z scale_ub: Optional[float], 2025-05-07T20:32:35.6234542Z contiguous: bool, 2025-05-07T20:32:35.6234632Z compiled: bool, 2025-05-07T20:32:35.6234709Z ) -> None: 2025-05-07T20:32:35.6234811Z torch.manual_seed(2025) 2025-05-07T20:32:35.6234884Z 2025-05-07T20:32:35.6235053Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6235137Z 2025-05-07T20:32:35.6235229Z x_sign = torch.sign(x) 2025-05-07T20:32:35.6235357Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.6240495Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.6240512Z 2025-05-07T20:32:35.6240653Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:35.6240658Z 2025-05-07T20:32:35.6240763Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6240996Z self=, 2025-05-07T20:32:35.6241072Z T=128, 2025-05-07T20:32:35.6241151Z D=5120, 2025-05-07T20:32:35.6241234Z scale_ub=1200.0, 2025-05-07T20:32:35.6241320Z contiguous=True, 2025-05-07T20:32:35.6241405Z compiled=True, 2025-05-07T20:32:35.6241477Z ) 2025-05-07T20:32:35.6241704Z self = 2025-05-07T20:32:35.6241875Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.6241880Z 2025-05-07T20:32:35.6241957Z @given( 2025-05-07T20:32:35.6242083Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6242186Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6242303Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6242483Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6242596Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6242671Z ) 2025-05-07T20:32:35.6242926Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6243018Z def test_silu_mul_quant( 2025-05-07T20:32:35.6243097Z self, 2025-05-07T20:32:35.6243176Z T: int, 2025-05-07T20:32:35.6243253Z D: int, 2025-05-07T20:32:35.6243356Z scale_ub: Optional[float], 2025-05-07T20:32:35.6243466Z contiguous: bool, 2025-05-07T20:32:35.6243562Z compiled: bool, 2025-05-07T20:32:35.6243662Z ) -> None: 2025-05-07T20:32:35.6243756Z torch.manual_seed(2025) 2025-05-07T20:32:35.6243832Z 2025-05-07T20:32:35.6244005Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6244077Z 2025-05-07T20:32:35.6244169Z x_sign = torch.sign(x) 2025-05-07T20:32:35.6244297Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.6246186Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.6246232Z 2025-05-07T20:32:35.6246351Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:35.6246393Z 2025-05-07T20:32:35.6246496Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6246726Z self=, 2025-05-07T20:32:35.6246801Z T=128, 2025-05-07T20:32:35.6246879Z D=7168, 2025-05-07T20:32:35.6246962Z scale_ub=None, 2025-05-07T20:32:35.6247046Z contiguous=True, 2025-05-07T20:32:35.6247127Z compiled=True, 2025-05-07T20:32:35.6247201Z ) 2025-05-07T20:32:35.6247426Z self = 2025-05-07T20:32:35.6247597Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.6247604Z 2025-05-07T20:32:35.6247680Z @given( 2025-05-07T20:32:35.6247797Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6247895Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6248007Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6248122Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6248241Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6248313Z ) 2025-05-07T20:32:35.6248566Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6248659Z def test_silu_mul_quant( 2025-05-07T20:32:35.6248734Z self, 2025-05-07T20:32:35.6248813Z T: int, 2025-05-07T20:32:35.6248888Z D: int, 2025-05-07T20:32:35.6248987Z scale_ub: Optional[float], 2025-05-07T20:32:35.6249076Z contiguous: bool, 2025-05-07T20:32:35.6249161Z compiled: bool, 2025-05-07T20:32:35.6249239Z ) -> None: 2025-05-07T20:32:35.6249338Z torch.manual_seed(2025) 2025-05-07T20:32:35.6249410Z 2025-05-07T20:32:35.6249577Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6251459Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.6251469Z 2025-05-07T20:32:35.6251588Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.6251725Z =============================== warnings summary =============================== 2025-05-07T20:32:35.6252113Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:35.6252431Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:35.6252741Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:35.6253704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:35.6253942Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:35.6253991Z 2025-05-07T20:32:35.6254210Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:35.6254383Z ================= 1 failed, 1 deselected, 3 warnings in 13.14s ================= 2025-05-07T20:32:37.1908322Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:37.2536059Z [EXEC] [ATTEMPT 1/2] Command attempt failed. 2025-05-07T20:32:37.2536281Z 2025-05-07T20:32:39.2555683Z [EXEC] [ATTEMPT 2/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:41.4013759Z ============================= test session starts ============================== 2025-05-07T20:32:41.4014959Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:41.4015539Z cachedir: .pytest_cache 2025-05-07T20:32:41.4016129Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:41.4016875Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:41.4017292Z plugins: hypothesis-6.131.14 2025-05-07T20:32:43.0459230Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:43.1539889Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:43.1540391Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:43.1540610Z 2025-05-07T20:32:45.5280644Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:45.5281367Z self=, 2025-05-07T20:32:45.5281796Z T=1, 2025-05-07T20:32:45.5281984Z D=5120, 2025-05-07T20:32:45.5282184Z scale_ub=None, 2025-05-07T20:32:45.5282393Z contiguous=True, 2025-05-07T20:32:45.5282619Z compiled=True, 2025-05-07T20:32:45.5282828Z ) 2025-05-07T20:32:45.5283152Z self = 2025-05-07T20:32:45.5283657Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:45.5283926Z 2025-05-07T20:32:45.5284011Z @given( 2025-05-07T20:32:45.5284244Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:45.5284571Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:45.5284891Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:45.5285225Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:45.5285852Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:45.5286153Z ) 2025-05-07T20:32:45.5286515Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:45.5286967Z def test_silu_mul_quant( 2025-05-07T20:32:45.5287214Z self, 2025-05-07T20:32:45.5287413Z T: int, 2025-05-07T20:32:45.5287610Z D: int, 2025-05-07T20:32:45.5287836Z scale_ub: Optional[float], 2025-05-07T20:32:45.5288117Z contiguous: bool, 2025-05-07T20:32:45.5288357Z compiled: bool, 2025-05-07T20:32:45.5288593Z ) -> None: 2025-05-07T20:32:45.5288818Z torch.manual_seed(2025) 2025-05-07T20:32:45.5289062Z 2025-05-07T20:32:45.5289340Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:45.5289698Z 2025-05-07T20:32:45.5289889Z x_sign = torch.sign(x) 2025-05-07T20:32:45.5290184Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:45.5290503Z x = x_sign * x_clamp 2025-05-07T20:32:45.5290745Z x0 = x[:, :D] 2025-05-07T20:32:45.5290966Z x1 = x[:, D:] 2025-05-07T20:32:45.5291178Z 2025-05-07T20:32:45.5291362Z if contiguous: 2025-05-07T20:32:45.5291687Z x0 = x0.contiguous() 2025-05-07T20:32:45.5292025Z x1 = x1.contiguous() 2025-05-07T20:32:45.5292269Z 2025-05-07T20:32:45.5292458Z if scale_ub is not None: 2025-05-07T20:32:45.5292823Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:45.5293167Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:45.5293479Z ) 2025-05-07T20:32:45.5293681Z else: 2025-05-07T20:32:45.5293900Z scale_ub_tensor = None 2025-05-07T20:32:45.5294151Z 2025-05-07T20:32:45.5294390Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:45.5294801Z op = silu_mul_quant 2025-05-07T20:32:45.5295052Z if compiled: 2025-05-07T20:32:45.5295306Z op = torch.compile(op) 2025-05-07T20:32:45.5295611Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:45.5295913Z 2025-05-07T20:32:45.5296135Z y_fp8, y_scale = fn() 2025-05-07T20:32:45.5296429Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:45.5296728Z 2025-05-07T20:32:45.5296966Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:45.5297310Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:45.5297613Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:45.5297929Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:45.5298299Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:45.5298620Z 2025-05-07T20:32:45.5298819Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:45.5299027Z 2025-05-07T20:32:45.5299131Z moe/activation_test.py:126: 2025-05-07T20:32:45.5299439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.5299780Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:45.5300118Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:45.5300943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:45.5301724Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:45.5302287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:45.5303000Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:45.5303713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:45.5304461Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:45.5305208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:45.5305923Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:45.5306835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:45.5307362Z fn() 2025-05-07T20:32:45.5307890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:45.5308489Z self.fn.run( 2025-05-07T20:32:45.5308966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:45.5309502Z kernel = self.compile( 2025-05-07T20:32:45.5310053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:45.5310728Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:45.5311130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.5311373Z 2025-05-07T20:32:45.5311585Z self = 2025-05-07T20:32:45.5312779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:45.5314278Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb33e99c60>} 2025-05-07T20:32:45.5315671Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:45.5316804Z context = 2025-05-07T20:32:45.5317105Z 2025-05-07T20:32:45.5317277Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:45.5317815Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:45.5318298Z module_map=module_map) 2025-05-07T20:32:45.5318665Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:45.5319029Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:45.5319303Z E ^ 2025-05-07T20:32:45.5319774Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:45.5320243Z 2025-05-07T20:32:45.5320673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:45.5321212Z 2025-05-07T20:32:45.5321317Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:45.5321745Z self=, 2025-05-07T20:32:45.5322156Z T=2048, 2025-05-07T20:32:45.5322352Z D=5120, 2025-05-07T20:32:45.5322550Z scale_ub=1200.0, 2025-05-07T20:32:45.5322770Z contiguous=True, 2025-05-07T20:32:45.5323000Z compiled=False, 2025-05-07T20:32:45.5323216Z ) 2025-05-07T20:32:46.2646323Z self = 2025-05-07T20:32:46.2647189Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:46.2647530Z 2025-05-07T20:32:46.2647621Z @given( 2025-05-07T20:32:46.2647855Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.2648178Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.2648492Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.2648825Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.2649176Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.2649469Z ) 2025-05-07T20:32:46.2650160Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.2650629Z def test_silu_mul_quant( 2025-05-07T20:32:46.2650884Z self, 2025-05-07T20:32:46.2651084Z T: int, 2025-05-07T20:32:46.2651300Z D: int, 2025-05-07T20:32:46.2651529Z scale_ub: Optional[float], 2025-05-07T20:32:46.2651904Z contiguous: bool, 2025-05-07T20:32:46.2652147Z compiled: bool, 2025-05-07T20:32:46.2652385Z ) -> None: 2025-05-07T20:32:46.2652615Z torch.manual_seed(2025) 2025-05-07T20:32:46.2652862Z 2025-05-07T20:32:46.2653143Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.2653506Z 2025-05-07T20:32:46.2653702Z x_sign = torch.sign(x) 2025-05-07T20:32:46.2654004Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.2654332Z x = x_sign * x_clamp 2025-05-07T20:32:46.2654575Z x0 = x[:, :D] 2025-05-07T20:32:46.2654799Z x1 = x[:, D:] 2025-05-07T20:32:46.2655014Z 2025-05-07T20:32:46.2655208Z if contiguous: 2025-05-07T20:32:46.2655447Z x0 = x0.contiguous() 2025-05-07T20:32:46.2655712Z x1 = x1.contiguous() 2025-05-07T20:32:46.2655953Z 2025-05-07T20:32:46.2656243Z if scale_ub is not None: 2025-05-07T20:32:46.2656525Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:46.2656865Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:46.2657268Z ) 2025-05-07T20:32:46.2657467Z else: 2025-05-07T20:32:46.2657686Z scale_ub_tensor = None 2025-05-07T20:32:46.2657938Z 2025-05-07T20:32:46.2658176Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.2658502Z op = silu_mul_quant 2025-05-07T20:32:46.2658755Z if compiled: 2025-05-07T20:32:46.2659119Z op = torch.compile(op) 2025-05-07T20:32:46.2659425Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.2659703Z 2025-05-07T20:32:46.2659904Z > y_fp8, y_scale = fn() 2025-05-07T20:32:46.2660073Z 2025-05-07T20:32:46.2660182Z moe/activation_test.py:117: 2025-05-07T20:32:46.2660484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.2660831Z moe/activation_test.py:115: in fn 2025-05-07T20:32:46.2661122Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.2661843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:46.2662559Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:46.2663117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:46.2663828Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:46.2664519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:46.2665079Z kernel = self.compile( 2025-05-07T20:32:46.2665645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:46.2666373Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.2666798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.2667046Z 2025-05-07T20:32:46.2667259Z self = 2025-05-07T20:32:46.2668393Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:46.2669848Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb33cf0220>} 2025-05-07T20:32:46.2671299Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:46.2672374Z context = 2025-05-07T20:32:46.2672685Z 2025-05-07T20:32:46.2672856Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:46.2673406Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.2673888Z module_map=module_map) 2025-05-07T20:32:46.2674267Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.2674635Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:46.2674909Z E ^ 2025-05-07T20:32:46.2675386Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.2675862Z 2025-05-07T20:32:46.2676302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:46.2676889Z 2025-05-07T20:32:46.2677000Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.2677475Z self=, 2025-05-07T20:32:46.2677895Z T=2048, 2025-05-07T20:32:46.2678088Z D=5120, 2025-05-07T20:32:46.2678288Z scale_ub=1200.0, 2025-05-07T20:32:46.2678559Z contiguous=True, 2025-05-07T20:32:46.2678791Z compiled=True, 2025-05-07T20:32:46.2679005Z ) 2025-05-07T20:32:46.2679328Z self = 2025-05-07T20:32:46.2679843Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:46.2680171Z 2025-05-07T20:32:46.2680256Z @given( 2025-05-07T20:32:46.2680487Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.2680809Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.2681131Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.2681465Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.2681806Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.2682104Z ) 2025-05-07T20:32:46.2682463Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.2682915Z def test_silu_mul_quant( 2025-05-07T20:32:46.2683168Z self, 2025-05-07T20:32:46.2683373Z T: int, 2025-05-07T20:32:46.2683574Z D: int, 2025-05-07T20:32:46.2683803Z scale_ub: Optional[float], 2025-05-07T20:32:46.2684087Z contiguous: bool, 2025-05-07T20:32:46.2684332Z compiled: bool, 2025-05-07T20:32:46.2684562Z ) -> None: 2025-05-07T20:32:46.2684789Z torch.manual_seed(2025) 2025-05-07T20:32:46.2685037Z 2025-05-07T20:32:46.2685319Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.2685674Z 2025-05-07T20:32:46.2685871Z x_sign = torch.sign(x) 2025-05-07T20:32:46.2686200Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.2686548Z x = x_sign * x_clamp 2025-05-07T20:32:46.2686800Z x0 = x[:, :D] 2025-05-07T20:32:46.2687019Z x1 = x[:, D:] 2025-05-07T20:32:46.2687233Z 2025-05-07T20:32:46.2687426Z if contiguous: 2025-05-07T20:32:46.2687658Z x0 = x0.contiguous() 2025-05-07T20:32:46.2687924Z x1 = x1.contiguous() 2025-05-07T20:32:46.2688173Z 2025-05-07T20:32:46.2688365Z if scale_ub is not None: 2025-05-07T20:32:46.2688642Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:46.2688985Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:46.2689300Z ) 2025-05-07T20:32:46.2689497Z else: 2025-05-07T20:32:46.2689712Z scale_ub_tensor = None 2025-05-07T20:32:46.2689963Z 2025-05-07T20:32:46.2690200Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.2690576Z op = silu_mul_quant 2025-05-07T20:32:46.2690831Z if compiled: 2025-05-07T20:32:46.2691084Z op = torch.compile(op) 2025-05-07T20:32:46.2691389Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.2691666Z 2025-05-07T20:32:46.2691917Z y_fp8, y_scale = fn() 2025-05-07T20:32:46.2692213Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:46.2692515Z 2025-05-07T20:32:46.2692753Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.2693097Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:46.2693408Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:46.2693731Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:46.2694103Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:46.2694426Z 2025-05-07T20:32:46.2694626Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:46.2694832Z 2025-05-07T20:32:46.2694940Z moe/activation_test.py:126: 2025-05-07T20:32:46.2695246Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.2695638Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:46.2695970Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:46.2696790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:46.2697623Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:46.2698183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:46.2698900Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:46.2699658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:46.2700413Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:46.2701174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:46.2701845Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:46.2702476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:46.2703022Z fn() 2025-05-07T20:32:46.2703543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:46.2704149Z self.fn.run( 2025-05-07T20:32:46.2704637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:46.2705188Z kernel = self.compile( 2025-05-07T20:32:46.2705750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:46.2706729Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.2707148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.2707390Z 2025-05-07T20:32:46.2707604Z self = 2025-05-07T20:32:46.2708734Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:46.2710179Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb33cf16c0>} 2025-05-07T20:32:46.2711692Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:46.2712764Z context = 2025-05-07T20:32:46.2713064Z 2025-05-07T20:32:46.2713240Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:46.2713788Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.2714271Z module_map=module_map) 2025-05-07T20:32:46.2714641Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.2715009Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:46.2715284Z E ^ 2025-05-07T20:32:46.2715767Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.2716237Z 2025-05-07T20:32:46.2716668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:46.2717204Z 2025-05-07T20:32:46.2717312Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.2717738Z self=, 2025-05-07T20:32:46.2718157Z T=16384, 2025-05-07T20:32:46.2718428Z D=7168, 2025-05-07T20:32:46.2718633Z scale_ub=1200.0, 2025-05-07T20:32:46.2718864Z contiguous=False, 2025-05-07T20:32:46.2719088Z compiled=False, 2025-05-07T20:32:46.2719300Z ) 2025-05-07T20:32:46.9953080Z self = 2025-05-07T20:32:46.9953867Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:46.9960966Z 2025-05-07T20:32:46.9961116Z @given( 2025-05-07T20:32:46.9961374Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.9962002Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.9962319Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.9962657Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.9962990Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.9963279Z ) 2025-05-07T20:32:46.9963644Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.9964098Z def test_silu_mul_quant( 2025-05-07T20:32:46.9964345Z self, 2025-05-07T20:32:46.9964551Z T: int, 2025-05-07T20:32:46.9964748Z D: int, 2025-05-07T20:32:46.9964978Z scale_ub: Optional[float], 2025-05-07T20:32:46.9965256Z contiguous: bool, 2025-05-07T20:32:46.9965495Z compiled: bool, 2025-05-07T20:32:46.9965736Z ) -> None: 2025-05-07T20:32:46.9965958Z torch.manual_seed(2025) 2025-05-07T20:32:46.9966200Z 2025-05-07T20:32:46.9966482Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.9966888Z 2025-05-07T20:32:46.9967084Z x_sign = torch.sign(x) 2025-05-07T20:32:46.9967382Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.9967703Z x = x_sign * x_clamp 2025-05-07T20:32:46.9967950Z x0 = x[:, :D] 2025-05-07T20:32:46.9968166Z x1 = x[:, D:] 2025-05-07T20:32:46.9968382Z 2025-05-07T20:32:46.9968576Z if contiguous: 2025-05-07T20:32:46.9968845Z x0 = x0.contiguous() 2025-05-07T20:32:46.9969101Z x1 = x1.contiguous() 2025-05-07T20:32:46.9969343Z 2025-05-07T20:32:46.9969540Z if scale_ub is not None: 2025-05-07T20:32:46.9969815Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:46.9970163Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:46.9970479Z ) 2025-05-07T20:32:46.9970675Z else: 2025-05-07T20:32:46.9970885Z scale_ub_tensor = None 2025-05-07T20:32:46.9971142Z 2025-05-07T20:32:46.9971378Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.9971694Z op = silu_mul_quant 2025-05-07T20:32:46.9972029Z if compiled: 2025-05-07T20:32:46.9972374Z op = torch.compile(op) 2025-05-07T20:32:46.9972675Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.9972957Z 2025-05-07T20:32:46.9973152Z > y_fp8, y_scale = fn() 2025-05-07T20:32:46.9973321Z 2025-05-07T20:32:46.9973423Z moe/activation_test.py:117: 2025-05-07T20:32:46.9973727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.9974077Z moe/activation_test.py:115: in fn 2025-05-07T20:32:46.9974367Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.9975258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:46.9975979Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:46.9976540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:46.9977237Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:46.9977929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:46.9978486Z kernel = self.compile( 2025-05-07T20:32:46.9979132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:46.9979807Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.9980291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.9980528Z 2025-05-07T20:32:46.9980746Z self = 2025-05-07T20:32:46.9981869Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:46.9983340Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb32b9cea0>} 2025-05-07T20:32:46.9984737Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:46.9985798Z context = 2025-05-07T20:32:46.9986097Z 2025-05-07T20:32:46.9986276Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:46.9986810Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.9987292Z module_map=module_map) 2025-05-07T20:32:46.9987669Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.9988035Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:46.9988298Z E ^ 2025-05-07T20:32:46.9988779Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.9989247Z 2025-05-07T20:32:46.9989687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:46.9990217Z 2025-05-07T20:32:46.9990327Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.9990752Z self=, 2025-05-07T20:32:46.9991168Z T=1, 2025-05-07T20:32:46.9991355Z D=7168, 2025-05-07T20:32:46.9991547Z scale_ub=None, 2025-05-07T20:32:46.9991766Z contiguous=True, 2025-05-07T20:32:46.9991994Z compiled=True, 2025-05-07T20:32:46.9992197Z ) 2025-05-07T20:32:46.9992529Z self = 2025-05-07T20:32:46.9993028Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:46.9993293Z 2025-05-07T20:32:46.9993371Z @given( 2025-05-07T20:32:46.9993649Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.9993975Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.9994293Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.9994630Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.9994969Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.9995264Z ) 2025-05-07T20:32:46.9995621Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.9996079Z def test_silu_mul_quant( 2025-05-07T20:32:46.9996327Z self, 2025-05-07T20:32:46.9996524Z T: int, 2025-05-07T20:32:46.9996760Z D: int, 2025-05-07T20:32:46.9997005Z scale_ub: Optional[float], 2025-05-07T20:32:46.9997279Z contiguous: bool, 2025-05-07T20:32:46.9997528Z compiled: bool, 2025-05-07T20:32:46.9997760Z ) -> None: 2025-05-07T20:32:46.9997975Z torch.manual_seed(2025) 2025-05-07T20:32:46.9998228Z 2025-05-07T20:32:46.9998507Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.9998850Z 2025-05-07T20:32:46.9999089Z x_sign = torch.sign(x) 2025-05-07T20:32:46.9999390Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.9999708Z x = x_sign * x_clamp 2025-05-07T20:32:46.9999950Z x0 = x[:, :D] 2025-05-07T20:32:47.0000255Z x1 = x[:, D:] 2025-05-07T20:32:47.0000473Z 2025-05-07T20:32:47.0000659Z if contiguous: 2025-05-07T20:32:47.0000896Z x0 = x0.contiguous() 2025-05-07T20:32:47.0001162Z x1 = x1.contiguous() 2025-05-07T20:32:47.0001403Z 2025-05-07T20:32:47.0001602Z if scale_ub is not None: 2025-05-07T20:32:47.0001926Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.0002266Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.0002585Z ) 2025-05-07T20:32:47.0002787Z else: 2025-05-07T20:32:47.0003003Z scale_ub_tensor = None 2025-05-07T20:32:47.0003262Z 2025-05-07T20:32:47.0003501Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.0003818Z op = silu_mul_quant 2025-05-07T20:32:47.0004079Z if compiled: 2025-05-07T20:32:47.0004335Z op = torch.compile(op) 2025-05-07T20:32:47.0004635Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.0004915Z 2025-05-07T20:32:47.0005111Z y_fp8, y_scale = fn() 2025-05-07T20:32:47.0005405Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:47.0005697Z 2025-05-07T20:32:47.0005941Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.0006607Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:47.0006932Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:47.0007282Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:47.0007689Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.0008037Z 2025-05-07T20:32:47.0008253Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:47.0008474Z 2025-05-07T20:32:47.0008590Z moe/activation_test.py:126: 2025-05-07T20:32:47.0008920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.0009296Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:47.0009667Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.0010626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:47.0011537Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:47.0012197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.0012904Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.0013697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:47.0014446Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:47.0015201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:47.0015869Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:47.0016495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:47.0017025Z fn() 2025-05-07T20:32:47.0017546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:47.0018151Z self.fn.run( 2025-05-07T20:32:47.0018626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.0019178Z kernel = self.compile( 2025-05-07T20:32:47.0019740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.0020481Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.0020893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.0021136Z 2025-05-07T20:32:47.0021421Z self = 2025-05-07T20:32:47.0022546Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.0024030Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb32bc6f20>} 2025-05-07T20:32:47.0025425Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.0026491Z context = 2025-05-07T20:32:47.0026794Z 2025-05-07T20:32:47.0026964Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.0027507Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.0027985Z module_map=module_map) 2025-05-07T20:32:47.0028361Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.0028727Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:47.0029000Z E ^ 2025-05-07T20:32:47.0029483Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.0029957Z 2025-05-07T20:32:47.0030389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.0030921Z 2025-05-07T20:32:47.0031034Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.0031454Z self=, 2025-05-07T20:32:47.0031870Z T=4096, 2025-05-07T20:32:47.0032061Z D=5120, 2025-05-07T20:32:47.0032252Z scale_ub=None, 2025-05-07T20:32:47.0032474Z contiguous=False, 2025-05-07T20:32:47.0032709Z compiled=False, 2025-05-07T20:32:47.0032908Z ) 2025-05-07T20:32:47.7960026Z self = 2025-05-07T20:32:47.7960776Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:47.7961191Z 2025-05-07T20:32:47.7961318Z @given( 2025-05-07T20:32:47.7961584Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.7961902Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.7962497Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.7962831Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.7963192Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.7963479Z ) 2025-05-07T20:32:47.7963838Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.7964299Z def test_silu_mul_quant( 2025-05-07T20:32:47.7964542Z self, 2025-05-07T20:32:47.7964742Z T: int, 2025-05-07T20:32:47.7964947Z D: int, 2025-05-07T20:32:47.7965168Z scale_ub: Optional[float], 2025-05-07T20:32:47.7965446Z contiguous: bool, 2025-05-07T20:32:47.7965692Z compiled: bool, 2025-05-07T20:32:47.7965924Z ) -> None: 2025-05-07T20:32:47.7966145Z torch.manual_seed(2025) 2025-05-07T20:32:47.7966398Z 2025-05-07T20:32:47.7966671Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.7967034Z 2025-05-07T20:32:47.7967271Z x_sign = torch.sign(x) 2025-05-07T20:32:47.7967569Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.7967981Z x = x_sign * x_clamp 2025-05-07T20:32:47.7968230Z x0 = x[:, :D] 2025-05-07T20:32:47.7968450Z x1 = x[:, D:] 2025-05-07T20:32:47.7968654Z 2025-05-07T20:32:47.7968846Z if contiguous: 2025-05-07T20:32:47.7969168Z x0 = x0.contiguous() 2025-05-07T20:32:47.7969426Z x1 = x1.contiguous() 2025-05-07T20:32:47.7969668Z 2025-05-07T20:32:47.7969862Z if scale_ub is not None: 2025-05-07T20:32:47.7970135Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.7970477Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.7970877Z ) 2025-05-07T20:32:47.7971069Z else: 2025-05-07T20:32:47.7971284Z scale_ub_tensor = None 2025-05-07T20:32:47.7971540Z 2025-05-07T20:32:47.7971776Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.7972200Z op = silu_mul_quant 2025-05-07T20:32:47.7972456Z if compiled: 2025-05-07T20:32:47.7972709Z op = torch.compile(op) 2025-05-07T20:32:47.7973010Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.7973293Z 2025-05-07T20:32:47.7973492Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.7973659Z 2025-05-07T20:32:47.7973764Z moe/activation_test.py:117: 2025-05-07T20:32:47.7974098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.7974440Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.7974728Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.7975634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.7976356Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.7976914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.7977619Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.7978305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.7978857Z kernel = self.compile( 2025-05-07T20:32:47.7979416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.7980089Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.7980501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.7980744Z 2025-05-07T20:32:47.7980959Z self = 2025-05-07T20:32:47.7982142Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.7983588Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb32bc7ec0>} 2025-05-07T20:32:47.7984986Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.7986052Z context = 2025-05-07T20:32:47.7986348Z 2025-05-07T20:32:47.7986525Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.7987120Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.7987594Z module_map=module_map) 2025-05-07T20:32:47.7987972Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.7988335Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.7988596Z E ^ 2025-05-07T20:32:47.7989119Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.7989584Z 2025-05-07T20:32:47.7990022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.7990594Z 2025-05-07T20:32:47.7990708Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.7991127Z self=, 2025-05-07T20:32:47.7991544Z T=4096, 2025-05-07T20:32:47.7991736Z D=7168, 2025-05-07T20:32:47.7991971Z scale_ub=None, 2025-05-07T20:32:47.7992192Z contiguous=False, 2025-05-07T20:32:47.7992424Z compiled=False, 2025-05-07T20:32:47.7992628Z ) 2025-05-07T20:32:47.7992964Z self = 2025-05-07T20:32:47.7993476Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:47.7993757Z 2025-05-07T20:32:47.7993842Z @given( 2025-05-07T20:32:47.7994074Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.7994395Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.7994709Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.7995041Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.7995377Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.7995671Z ) 2025-05-07T20:32:47.7996023Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.7996475Z def test_silu_mul_quant( 2025-05-07T20:32:47.7996728Z self, 2025-05-07T20:32:47.7996920Z T: int, 2025-05-07T20:32:47.7997121Z D: int, 2025-05-07T20:32:47.7997341Z scale_ub: Optional[float], 2025-05-07T20:32:47.7997615Z contiguous: bool, 2025-05-07T20:32:47.7997860Z compiled: bool, 2025-05-07T20:32:47.7998090Z ) -> None: 2025-05-07T20:32:47.7998310Z torch.manual_seed(2025) 2025-05-07T20:32:47.7998555Z 2025-05-07T20:32:47.7998832Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.7999185Z 2025-05-07T20:32:47.7999378Z x_sign = torch.sign(x) 2025-05-07T20:32:47.7999678Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.7999996Z x = x_sign * x_clamp 2025-05-07T20:32:47.8000238Z x0 = x[:, :D] 2025-05-07T20:32:47.8000461Z x1 = x[:, D:] 2025-05-07T20:32:47.8000672Z 2025-05-07T20:32:47.8000856Z if contiguous: 2025-05-07T20:32:47.8001095Z x0 = x0.contiguous() 2025-05-07T20:32:47.8001362Z x1 = x1.contiguous() 2025-05-07T20:32:47.8001601Z 2025-05-07T20:32:47.8001799Z if scale_ub is not None: 2025-05-07T20:32:47.8002164Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.8002501Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.8002815Z ) 2025-05-07T20:32:47.8003013Z else: 2025-05-07T20:32:47.8003231Z scale_ub_tensor = None 2025-05-07T20:32:47.8003481Z 2025-05-07T20:32:47.8003720Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.8004046Z op = silu_mul_quant 2025-05-07T20:32:47.8004295Z if compiled: 2025-05-07T20:32:47.8004545Z op = torch.compile(op) 2025-05-07T20:32:47.8004846Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.8005119Z 2025-05-07T20:32:47.8005313Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.8005478Z 2025-05-07T20:32:47.8005588Z moe/activation_test.py:117: 2025-05-07T20:32:47.8005885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.8006501Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.8006792Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.8007608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.8008320Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.8008874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.8009645Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.8010326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.8010878Z kernel = self.compile( 2025-05-07T20:32:47.8011438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.8012244Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.8012653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.8012895Z 2025-05-07T20:32:47.8013109Z self = 2025-05-07T20:32:47.8014235Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.8015665Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb32bad620>} 2025-05-07T20:32:47.8017114Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.8018189Z context = 2025-05-07T20:32:47.8018493Z 2025-05-07T20:32:47.8018666Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.8019206Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.8019684Z module_map=module_map) 2025-05-07T20:32:47.8020058Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.8020423Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.8020689Z E ^ 2025-05-07T20:32:47.8021160Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.8021632Z 2025-05-07T20:32:47.8022061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.8022594Z 2025-05-07T20:32:47.8022706Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.8023127Z self=, 2025-05-07T20:32:47.8023616Z T=128, 2025-05-07T20:32:47.8023813Z D=7168, 2025-05-07T20:32:47.8024014Z scale_ub=None, 2025-05-07T20:32:47.8024233Z contiguous=False, 2025-05-07T20:32:47.8024471Z compiled=True, 2025-05-07T20:32:47.8024676Z ) 2025-05-07T20:32:47.8583899Z self = 2025-05-07T20:32:47.8584638Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:47.8585033Z 2025-05-07T20:32:47.8585172Z @given( 2025-05-07T20:32:47.8585488Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.8585840Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.8586158Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.8586552Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.8586888Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.8587171Z ) 2025-05-07T20:32:47.8587530Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.8587983Z def test_silu_mul_quant( 2025-05-07T20:32:47.8588223Z self, 2025-05-07T20:32:47.8588651Z T: int, 2025-05-07T20:32:47.8588856Z D: int, 2025-05-07T20:32:47.8589071Z scale_ub: Optional[float], 2025-05-07T20:32:47.8589346Z contiguous: bool, 2025-05-07T20:32:47.8589595Z compiled: bool, 2025-05-07T20:32:47.8589889Z ) -> None: 2025-05-07T20:32:47.8590107Z torch.manual_seed(2025) 2025-05-07T20:32:47.8590354Z 2025-05-07T20:32:47.8590629Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.8590977Z 2025-05-07T20:32:47.8591177Z x_sign = torch.sign(x) 2025-05-07T20:32:47.8591545Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.8591867Z x = x_sign * x_clamp 2025-05-07T20:32:47.8592110Z x0 = x[:, :D] 2025-05-07T20:32:47.8592329Z x1 = x[:, D:] 2025-05-07T20:32:47.8592534Z 2025-05-07T20:32:47.8592729Z if contiguous: 2025-05-07T20:32:47.8592965Z x0 = x0.contiguous() 2025-05-07T20:32:47.8593222Z x1 = x1.contiguous() 2025-05-07T20:32:47.8593467Z 2025-05-07T20:32:47.8593659Z if scale_ub is not None: 2025-05-07T20:32:47.8593928Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.8594283Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.8601220Z ) 2025-05-07T20:32:47.8601436Z else: 2025-05-07T20:32:47.8601653Z scale_ub_tensor = None 2025-05-07T20:32:47.8601918Z 2025-05-07T20:32:47.8602158Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.8602487Z op = silu_mul_quant 2025-05-07T20:32:47.8602750Z if compiled: 2025-05-07T20:32:47.8603000Z op = torch.compile(op) 2025-05-07T20:32:47.8603307Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.8603591Z 2025-05-07T20:32:47.8603793Z y_fp8, y_scale = fn() 2025-05-07T20:32:47.8604081Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:47.8604382Z 2025-05-07T20:32:47.8604628Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.8604967Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:47.8605272Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:47.8605597Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:47.8605960Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.8606556Z 2025-05-07T20:32:47.8606765Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:47.8606967Z 2025-05-07T20:32:47.8607076Z moe/activation_test.py:126: 2025-05-07T20:32:47.8607379Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.8607724Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:47.8608061Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.8609001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:47.8609789Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:47.8610350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.8611054Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.8611758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:47.8612580Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:47.8613341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:47.8613992Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:47.8614617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:47.8615149Z fn() 2025-05-07T20:32:47.8615746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:47.8616378Z self.fn.run( 2025-05-07T20:32:47.8616875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.8617489Z kernel = self.compile( 2025-05-07T20:32:47.8618039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.8618713Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.8619189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.8619426Z 2025-05-07T20:32:47.8619646Z self = 2025-05-07T20:32:47.8620770Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.8622206Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb32bacd60>} 2025-05-07T20:32:47.8623602Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.8624666Z context = 2025-05-07T20:32:47.8624965Z 2025-05-07T20:32:47.8625141Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.8625676Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.8626174Z module_map=module_map) 2025-05-07T20:32:47.8626587Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.8626947Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:47.8627220Z E ^ 2025-05-07T20:32:47.8627697Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.8628163Z 2025-05-07T20:32:47.8628600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.8629131Z 2025-05-07T20:32:47.8629237Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.8629670Z self=, 2025-05-07T20:32:47.8630086Z T=128, 2025-05-07T20:32:47.8630273Z D=7168, 2025-05-07T20:32:47.8630472Z scale_ub=None, 2025-05-07T20:32:47.8630745Z contiguous=False, 2025-05-07T20:32:47.8630973Z compiled=False, 2025-05-07T20:32:47.8631192Z ) 2025-05-07T20:32:48.0597567Z self = 2025-05-07T20:32:48.0598325Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:48.0598711Z 2025-05-07T20:32:48.0598827Z @given( 2025-05-07T20:32:48.0599127Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.0599538Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.0599897Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.0600231Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.0600558Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.0600855Z ) 2025-05-07T20:32:48.0601208Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.0601659Z def test_silu_mul_quant( 2025-05-07T20:32:48.0601911Z self, 2025-05-07T20:32:48.0602118Z T: int, 2025-05-07T20:32:48.0602316Z D: int, 2025-05-07T20:32:48.0602541Z scale_ub: Optional[float], 2025-05-07T20:32:48.0603088Z contiguous: bool, 2025-05-07T20:32:48.0603335Z compiled: bool, 2025-05-07T20:32:48.0603573Z ) -> None: 2025-05-07T20:32:48.0603794Z torch.manual_seed(2025) 2025-05-07T20:32:48.0604044Z 2025-05-07T20:32:48.0604414Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.0604768Z 2025-05-07T20:32:48.0604968Z x_sign = torch.sign(x) 2025-05-07T20:32:48.0605262Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.0605586Z x = x_sign * x_clamp 2025-05-07T20:32:48.0605934Z x0 = x[:, :D] 2025-05-07T20:32:48.0606459Z x1 = x[:, D:] 2025-05-07T20:32:48.0606718Z 2025-05-07T20:32:48.0606912Z if contiguous: 2025-05-07T20:32:48.0607147Z x0 = x0.contiguous() 2025-05-07T20:32:48.0607424Z x1 = x1.contiguous() 2025-05-07T20:32:48.0607673Z 2025-05-07T20:32:48.0607865Z if scale_ub is not None: 2025-05-07T20:32:48.0608153Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.0608501Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.0608815Z ) 2025-05-07T20:32:48.0609018Z else: 2025-05-07T20:32:48.0609241Z scale_ub_tensor = None 2025-05-07T20:32:48.0609498Z 2025-05-07T20:32:48.0609738Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.0610064Z op = silu_mul_quant 2025-05-07T20:32:48.0610323Z if compiled: 2025-05-07T20:32:48.0610571Z op = torch.compile(op) 2025-05-07T20:32:48.0610879Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.0611165Z 2025-05-07T20:32:48.0611357Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.0611536Z 2025-05-07T20:32:48.0611639Z moe/activation_test.py:117: 2025-05-07T20:32:48.0612037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.0612375Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.0612666Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.0613384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.0614104Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.0614651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.0615357Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.0616051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.0616600Z kernel = self.compile( 2025-05-07T20:32:48.0617261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.0617948Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.0618365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.0618602Z 2025-05-07T20:32:48.0618815Z self = 2025-05-07T20:32:48.0619937Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.0621386Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0dd92340>} 2025-05-07T20:32:48.0622799Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.0623890Z context = 2025-05-07T20:32:48.0624192Z 2025-05-07T20:32:48.0624432Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.0624985Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.0625538Z module_map=module_map) 2025-05-07T20:32:48.0625912Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.0626281Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.0626553Z E ^ 2025-05-07T20:32:48.0627047Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.0627631Z 2025-05-07T20:32:48.0628075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.0628632Z 2025-05-07T20:32:48.0628739Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.0629176Z self=, 2025-05-07T20:32:48.0629597Z T=4096, 2025-05-07T20:32:48.0629795Z D=5120, 2025-05-07T20:32:48.0629998Z scale_ub=1200.0, 2025-05-07T20:32:48.0630228Z contiguous=True, 2025-05-07T20:32:48.0630454Z compiled=False, 2025-05-07T20:32:48.0630677Z ) 2025-05-07T20:32:48.0631020Z self = 2025-05-07T20:32:48.0631541Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:48.0631836Z 2025-05-07T20:32:48.0631917Z @given( 2025-05-07T20:32:48.0632155Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.0632479Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.0632799Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.0633147Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.0633482Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.0633781Z ) 2025-05-07T20:32:48.0634152Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.0634619Z def test_silu_mul_quant( 2025-05-07T20:32:48.0634867Z self, 2025-05-07T20:32:48.0635070Z T: int, 2025-05-07T20:32:48.0635281Z D: int, 2025-05-07T20:32:48.0635502Z scale_ub: Optional[float], 2025-05-07T20:32:48.0635787Z contiguous: bool, 2025-05-07T20:32:48.0636038Z compiled: bool, 2025-05-07T20:32:48.0636262Z ) -> None: 2025-05-07T20:32:48.0636494Z torch.manual_seed(2025) 2025-05-07T20:32:48.0636777Z 2025-05-07T20:32:48.0637054Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.0637411Z 2025-05-07T20:32:48.0637609Z x_sign = torch.sign(x) 2025-05-07T20:32:48.0637903Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.0638279Z x = x_sign * x_clamp 2025-05-07T20:32:48.0638527Z x0 = x[:, :D] 2025-05-07T20:32:48.0638745Z x1 = x[:, D:] 2025-05-07T20:32:48.0638962Z 2025-05-07T20:32:48.0639155Z if contiguous: 2025-05-07T20:32:48.0639395Z x0 = x0.contiguous() 2025-05-07T20:32:48.0639656Z x1 = x1.contiguous() 2025-05-07T20:32:48.0639908Z 2025-05-07T20:32:48.0640106Z if scale_ub is not None: 2025-05-07T20:32:48.0640382Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.0640729Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.0641050Z ) 2025-05-07T20:32:48.0641243Z else: 2025-05-07T20:32:48.0641457Z scale_ub_tensor = None 2025-05-07T20:32:48.0641738Z 2025-05-07T20:32:48.0641980Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.0642302Z op = silu_mul_quant 2025-05-07T20:32:48.0642567Z if compiled: 2025-05-07T20:32:48.0642826Z op = torch.compile(op) 2025-05-07T20:32:48.0643132Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.0643411Z 2025-05-07T20:32:48.0643666Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.0643838Z 2025-05-07T20:32:48.0643950Z moe/activation_test.py:117: 2025-05-07T20:32:48.0644253Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.0644639Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.0644931Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.0645639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.0646361Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.0647000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.0647728Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.0648415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.0648974Z kernel = self.compile( 2025-05-07T20:32:48.0649537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.0650207Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.0650625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.0650873Z 2025-05-07T20:32:48.0651087Z self = 2025-05-07T20:32:48.0652261Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.0653693Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0dd91c60>} 2025-05-07T20:32:48.0655086Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.0656152Z context = 2025-05-07T20:32:48.0656457Z 2025-05-07T20:32:48.0656639Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.0657219Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.0657698Z module_map=module_map) 2025-05-07T20:32:48.0658075Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.0658439Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.0658703Z E ^ 2025-05-07T20:32:48.0659237Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.0659712Z 2025-05-07T20:32:48.0660151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.0660682Z 2025-05-07T20:32:48.0660794Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.0661218Z self=, 2025-05-07T20:32:48.0661636Z T=1, 2025-05-07T20:32:48.0661823Z D=5120, 2025-05-07T20:32:48.0662016Z scale_ub=None, 2025-05-07T20:32:48.0662244Z contiguous=True, 2025-05-07T20:32:48.0662473Z compiled=True, 2025-05-07T20:32:48.0662676Z ) 2025-05-07T20:32:48.4399016Z self = 2025-05-07T20:32:48.4399751Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:48.4400076Z 2025-05-07T20:32:48.4400184Z @given( 2025-05-07T20:32:48.4400423Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.4401010Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.4401338Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.4401673Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.4401997Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.4402380Z ) 2025-05-07T20:32:48.4402743Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.4403201Z def test_silu_mul_quant( 2025-05-07T20:32:48.4403443Z self, 2025-05-07T20:32:48.4403649Z T: int, 2025-05-07T20:32:48.4403851Z D: int, 2025-05-07T20:32:48.4404157Z scale_ub: Optional[float], 2025-05-07T20:32:48.4404432Z contiguous: bool, 2025-05-07T20:32:48.4404678Z compiled: bool, 2025-05-07T20:32:48.4404908Z ) -> None: 2025-05-07T20:32:48.4405136Z torch.manual_seed(2025) 2025-05-07T20:32:48.4405384Z 2025-05-07T20:32:48.4405653Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.4406004Z 2025-05-07T20:32:48.4406559Z x_sign = torch.sign(x) 2025-05-07T20:32:48.4406858Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.4407174Z x = x_sign * x_clamp 2025-05-07T20:32:48.4407461Z x0 = x[:, :D] 2025-05-07T20:32:48.4407681Z x1 = x[:, D:] 2025-05-07T20:32:48.4407896Z 2025-05-07T20:32:48.4408075Z if contiguous: 2025-05-07T20:32:48.4408296Z x0 = x0.contiguous() 2025-05-07T20:32:48.4408556Z x1 = x1.contiguous() 2025-05-07T20:32:48.4408791Z 2025-05-07T20:32:48.4408975Z if scale_ub is not None: 2025-05-07T20:32:48.4409247Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.4409583Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.4409897Z ) 2025-05-07T20:32:48.4410088Z else: 2025-05-07T20:32:48.4410301Z scale_ub_tensor = None 2025-05-07T20:32:48.4410554Z 2025-05-07T20:32:48.4410789Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.4411108Z op = silu_mul_quant 2025-05-07T20:32:48.4411362Z if compiled: 2025-05-07T20:32:48.4411605Z op = torch.compile(op) 2025-05-07T20:32:48.4412021Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.4412303Z 2025-05-07T20:32:48.4412489Z y_fp8, y_scale = fn() 2025-05-07T20:32:48.4412778Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:48.4413074Z 2025-05-07T20:32:48.4413307Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.4413649Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:48.4413945Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:48.4414269Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:48.4414743Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:48.4415068Z 2025-05-07T20:32:48.4415277Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:48.4415479Z 2025-05-07T20:32:48.4415584Z moe/activation_test.py:126: 2025-05-07T20:32:48.4415890Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.4416233Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:48.4416568Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:48.4417384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:48.4418163Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:48.4418731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.4419433Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.4420144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:48.4420958Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:48.4421716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:48.4422432Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:48.4423056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:48.4423593Z fn() 2025-05-07T20:32:48.4424110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:48.4424783Z self.fn.run( 2025-05-07T20:32:48.4425262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.4425809Z kernel = self.compile( 2025-05-07T20:32:48.4426361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.4427075Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.4427494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.4427733Z 2025-05-07T20:32:48.4427950Z self = 2025-05-07T20:32:48.4429064Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.4430514Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0dd932e0>} 2025-05-07T20:32:48.4431917Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.4432985Z context = 2025-05-07T20:32:48.4433281Z 2025-05-07T20:32:48.4433456Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.4433993Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.4434474Z module_map=module_map) 2025-05-07T20:32:48.4434848Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.4435210Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:48.4435487Z E ^ 2025-05-07T20:32:48.4435964Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.4436476Z 2025-05-07T20:32:48.4436905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.4437444Z 2025-05-07T20:32:48.4437547Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.4437970Z self=, 2025-05-07T20:32:48.4438381Z T=2048, 2025-05-07T20:32:48.4438568Z D=5120, 2025-05-07T20:32:48.4438763Z scale_ub=None, 2025-05-07T20:32:48.4438983Z contiguous=True, 2025-05-07T20:32:48.4439203Z compiled=True, 2025-05-07T20:32:48.4439411Z ) 2025-05-07T20:32:48.8033981Z self = 2025-05-07T20:32:48.8035457Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:48.8036239Z 2025-05-07T20:32:48.8036389Z @given( 2025-05-07T20:32:48.8036695Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.8037051Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.8037355Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.8038004Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.8038345Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.8038628Z ) 2025-05-07T20:32:48.8038984Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.8039565Z def test_silu_mul_quant( 2025-05-07T20:32:48.8039806Z self, 2025-05-07T20:32:48.8040007Z T: int, 2025-05-07T20:32:48.8040210Z D: int, 2025-05-07T20:32:48.8040430Z scale_ub: Optional[float], 2025-05-07T20:32:48.8040708Z contiguous: bool, 2025-05-07T20:32:48.8040957Z compiled: bool, 2025-05-07T20:32:48.8041264Z ) -> None: 2025-05-07T20:32:48.8041488Z torch.manual_seed(2025) 2025-05-07T20:32:48.8041737Z 2025-05-07T20:32:48.8042017Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.8042366Z 2025-05-07T20:32:48.8042561Z x_sign = torch.sign(x) 2025-05-07T20:32:48.8042855Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.8043165Z x = x_sign * x_clamp 2025-05-07T20:32:48.8043407Z x0 = x[:, :D] 2025-05-07T20:32:48.8043629Z x1 = x[:, D:] 2025-05-07T20:32:48.8043831Z 2025-05-07T20:32:48.8044020Z if contiguous: 2025-05-07T20:32:48.8044258Z x0 = x0.contiguous() 2025-05-07T20:32:48.8044514Z x1 = x1.contiguous() 2025-05-07T20:32:48.8044758Z 2025-05-07T20:32:48.8044952Z if scale_ub is not None: 2025-05-07T20:32:48.8045220Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.8045561Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.8045881Z ) 2025-05-07T20:32:48.8046071Z else: 2025-05-07T20:32:48.8046288Z scale_ub_tensor = None 2025-05-07T20:32:48.8046541Z 2025-05-07T20:32:48.8053832Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.8054172Z op = silu_mul_quant 2025-05-07T20:32:48.8054437Z if compiled: 2025-05-07T20:32:48.8054694Z op = torch.compile(op) 2025-05-07T20:32:48.8054996Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.8055268Z 2025-05-07T20:32:48.8055465Z y_fp8, y_scale = fn() 2025-05-07T20:32:48.8055761Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:48.8056057Z 2025-05-07T20:32:48.8056299Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.8056650Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:48.8056990Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:48.8057317Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:48.8057683Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:48.8057998Z 2025-05-07T20:32:48.8058531Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:48.8058747Z 2025-05-07T20:32:48.8058852Z moe/activation_test.py:126: 2025-05-07T20:32:48.8059167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.8059507Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:48.8059845Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:48.8060662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:48.8061445Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:48.8061999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.8062704Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.8063414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:48.8064165Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:48.8064972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:48.8065637Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:48.8066255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:48.8066833Z fn() 2025-05-07T20:32:48.8067353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:48.8067954Z self.fn.run( 2025-05-07T20:32:48.8068433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.8069018Z kernel = self.compile( 2025-05-07T20:32:48.8069579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.8070254Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.8070657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.8070898Z 2025-05-07T20:32:48.8071110Z self = 2025-05-07T20:32:48.8072232Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.8073667Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb32475ee0>} 2025-05-07T20:32:48.8075060Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.8076116Z context = 2025-05-07T20:32:48.8076424Z 2025-05-07T20:32:48.8076593Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.8077140Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.8077660Z module_map=module_map) 2025-05-07T20:32:48.8078026Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.8078391Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:48.8078662Z E ^ 2025-05-07T20:32:48.8079135Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.8079606Z 2025-05-07T20:32:48.8080087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.8080628Z 2025-05-07T20:32:48.8080732Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.8081162Z self=, 2025-05-07T20:32:48.8081570Z T=128, 2025-05-07T20:32:48.8081764Z D=5120, 2025-05-07T20:32:48.8081962Z scale_ub=None, 2025-05-07T20:32:48.8082175Z contiguous=True, 2025-05-07T20:32:48.8082410Z compiled=True, 2025-05-07T20:32:48.8082617Z ) 2025-05-07T20:32:49.2266964Z self = 2025-05-07T20:32:49.2267714Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:49.2268036Z 2025-05-07T20:32:49.2268121Z @given( 2025-05-07T20:32:49.2268390Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.2268706Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.2269018Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.2269363Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.2269693Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.2269988Z ) 2025-05-07T20:32:49.2270634Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.2271092Z def test_silu_mul_quant( 2025-05-07T20:32:49.2271349Z self, 2025-05-07T20:32:49.2271557Z T: int, 2025-05-07T20:32:49.2271848Z D: int, 2025-05-07T20:32:49.2272078Z scale_ub: Optional[float], 2025-05-07T20:32:49.2272357Z contiguous: bool, 2025-05-07T20:32:49.2272599Z compiled: bool, 2025-05-07T20:32:49.2272836Z ) -> None: 2025-05-07T20:32:49.2273060Z torch.manual_seed(2025) 2025-05-07T20:32:49.2273311Z 2025-05-07T20:32:49.2273674Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.2274027Z 2025-05-07T20:32:49.2274229Z x_sign = torch.sign(x) 2025-05-07T20:32:49.2274526Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.2274847Z x = x_sign * x_clamp 2025-05-07T20:32:49.2275100Z x0 = x[:, :D] 2025-05-07T20:32:49.2275319Z x1 = x[:, D:] 2025-05-07T20:32:49.2275537Z 2025-05-07T20:32:49.2275731Z if contiguous: 2025-05-07T20:32:49.2275967Z x0 = x0.contiguous() 2025-05-07T20:32:49.2276242Z x1 = x1.contiguous() 2025-05-07T20:32:49.2276496Z 2025-05-07T20:32:49.2276688Z if scale_ub is not None: 2025-05-07T20:32:49.2276968Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.2277314Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.2277627Z ) 2025-05-07T20:32:49.2277831Z else: 2025-05-07T20:32:49.2278053Z scale_ub_tensor = None 2025-05-07T20:32:49.2278315Z 2025-05-07T20:32:49.2278553Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.2278876Z op = silu_mul_quant 2025-05-07T20:32:49.2279137Z if compiled: 2025-05-07T20:32:49.2279387Z op = torch.compile(op) 2025-05-07T20:32:49.2279691Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.2279976Z 2025-05-07T20:32:49.2280176Z y_fp8, y_scale = fn() 2025-05-07T20:32:49.2280472Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:49.2280778Z 2025-05-07T20:32:49.2281020Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.2281373Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:49.2281681Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:49.2282001Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:49.2282377Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:49.2282705Z 2025-05-07T20:32:49.2282917Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:49.2283116Z 2025-05-07T20:32:49.2283220Z moe/activation_test.py:126: 2025-05-07T20:32:49.2283625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.2283979Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:49.2284314Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:49.2285138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:49.2285927Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:49.2286495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.2287201Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.2287917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:49.2288673Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:49.2289435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:49.2290144Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:49.2290769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:49.2291304Z fn() 2025-05-07T20:32:49.2291907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:49.2292561Z self.fn.run( 2025-05-07T20:32:49.2293042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.2293595Z kernel = self.compile( 2025-05-07T20:32:49.2294195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.2294877Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.2295296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.2295534Z 2025-05-07T20:32:49.2295751Z self = 2025-05-07T20:32:49.2296874Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.2298324Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0d6fdd00>} 2025-05-07T20:32:49.2299720Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.2300793Z context = 2025-05-07T20:32:49.2301090Z 2025-05-07T20:32:49.2301261Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.2301807Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.2302295Z module_map=module_map) 2025-05-07T20:32:49.2302672Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.2303041Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:49.2303319Z E ^ 2025-05-07T20:32:49.2303806Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.2304274Z 2025-05-07T20:32:49.2304707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.2305252Z 2025-05-07T20:32:49.2305358Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.2305839Z self=, 2025-05-07T20:32:49.2306599Z T=4096, 2025-05-07T20:32:49.2306791Z D=5120, 2025-05-07T20:32:49.2306993Z scale_ub=None, 2025-05-07T20:32:49.2307222Z contiguous=True, 2025-05-07T20:32:49.2307450Z compiled=True, 2025-05-07T20:32:49.2307666Z ) 2025-05-07T20:32:49.6554327Z self = 2025-05-07T20:32:49.6555085Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:49.6555367Z 2025-05-07T20:32:49.6555464Z @given( 2025-05-07T20:32:49.6555707Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.6556021Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.6556342Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.6556675Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.6557008Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.6557299Z ) 2025-05-07T20:32:49.6557666Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.6558118Z def test_silu_mul_quant( 2025-05-07T20:32:49.6558709Z self, 2025-05-07T20:32:49.6558919Z T: int, 2025-05-07T20:32:49.6559119Z D: int, 2025-05-07T20:32:49.6559345Z scale_ub: Optional[float], 2025-05-07T20:32:49.6559626Z contiguous: bool, 2025-05-07T20:32:49.6559952Z compiled: bool, 2025-05-07T20:32:49.6560195Z ) -> None: 2025-05-07T20:32:49.6560422Z torch.manual_seed(2025) 2025-05-07T20:32:49.6560676Z 2025-05-07T20:32:49.6560951Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.6561305Z 2025-05-07T20:32:49.6561593Z x_sign = torch.sign(x) 2025-05-07T20:32:49.6561888Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.6562217Z x = x_sign * x_clamp 2025-05-07T20:32:49.6562467Z x0 = x[:, :D] 2025-05-07T20:32:49.6562706Z x1 = x[:, D:] 2025-05-07T20:32:49.6562925Z 2025-05-07T20:32:49.6563123Z if contiguous: 2025-05-07T20:32:49.6563359Z x0 = x0.contiguous() 2025-05-07T20:32:49.6563632Z x1 = x1.contiguous() 2025-05-07T20:32:49.6563883Z 2025-05-07T20:32:49.6564077Z if scale_ub is not None: 2025-05-07T20:32:49.6564361Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.6564714Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.6565034Z ) 2025-05-07T20:32:49.6565230Z else: 2025-05-07T20:32:49.6565451Z scale_ub_tensor = None 2025-05-07T20:32:49.6565713Z 2025-05-07T20:32:49.6565949Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.6566278Z op = silu_mul_quant 2025-05-07T20:32:49.6566540Z if compiled: 2025-05-07T20:32:49.6566790Z op = torch.compile(op) 2025-05-07T20:32:49.6567099Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.6567386Z 2025-05-07T20:32:49.6567582Z y_fp8, y_scale = fn() 2025-05-07T20:32:49.6567876Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:49.6568179Z 2025-05-07T20:32:49.6568418Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.6568763Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:49.6569066Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:49.6569393Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:49.6569761Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:49.6570081Z 2025-05-07T20:32:49.6570292Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:49.6570492Z 2025-05-07T20:32:49.6570598Z moe/activation_test.py:126: 2025-05-07T20:32:49.6570907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.6571254Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:49.6571679Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:49.6572609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:49.6573393Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:49.6573961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.6574664Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.6575379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:49.6576129Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:49.6576916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:49.6577599Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:49.6578223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:49.6578815Z fn() 2025-05-07T20:32:49.6579336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:49.6579944Z self.fn.run( 2025-05-07T20:32:49.6580472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.6581023Z kernel = self.compile( 2025-05-07T20:32:49.6581576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.6582304Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.6582719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.6582958Z 2025-05-07T20:32:49.6583182Z self = 2025-05-07T20:32:49.6584305Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.6585752Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0d63b560>} 2025-05-07T20:32:49.6587161Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.6588256Z context = 2025-05-07T20:32:49.6588561Z 2025-05-07T20:32:49.6588736Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.6589291Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.6589782Z module_map=module_map) 2025-05-07T20:32:49.6590163Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.6590527Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:49.6590805Z E ^ 2025-05-07T20:32:49.6591292Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.6591760Z 2025-05-07T20:32:49.6592191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.6592729Z 2025-05-07T20:32:49.6592840Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.6593268Z self=, 2025-05-07T20:32:49.6593686Z T=16384, 2025-05-07T20:32:49.6593884Z D=5120, 2025-05-07T20:32:49.6594139Z scale_ub=None, 2025-05-07T20:32:49.6594367Z contiguous=True, 2025-05-07T20:32:49.6594595Z compiled=True, 2025-05-07T20:32:49.6594809Z ) 2025-05-07T20:32:49.6851707Z W0507 20:32:49.683000 98222 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:49.6853400Z W0507 20:32:49.683000 98222 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:49.6854802Z W0507 20:32:49.683000 98222 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:49.6855838Z W0507 20:32:49.683000 98222 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:49.6856991Z W0507 20:32:49.683000 98222 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:49.7744139Z self = 2025-05-07T20:32:49.7744905Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:49.7745225Z 2025-05-07T20:32:49.7745416Z @given( 2025-05-07T20:32:49.7745647Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.7745965Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.7746274Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.7746603Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.7747014Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.7747326Z ) 2025-05-07T20:32:49.7747696Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.7748147Z def test_silu_mul_quant( 2025-05-07T20:32:49.7748398Z self, 2025-05-07T20:32:49.7748590Z T: int, 2025-05-07T20:32:49.7748789Z D: int, 2025-05-07T20:32:49.7749011Z scale_ub: Optional[float], 2025-05-07T20:32:49.7749278Z contiguous: bool, 2025-05-07T20:32:49.7749523Z compiled: bool, 2025-05-07T20:32:49.7749758Z ) -> None: 2025-05-07T20:32:49.7749984Z torch.manual_seed(2025) 2025-05-07T20:32:49.7750229Z 2025-05-07T20:32:49.7750508Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.7750857Z 2025-05-07T20:32:49.7751051Z x_sign = torch.sign(x) 2025-05-07T20:32:49.7751350Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.7751669Z x = x_sign * x_clamp 2025-05-07T20:32:49.7751914Z x0 = x[:, :D] 2025-05-07T20:32:49.7752140Z x1 = x[:, D:] 2025-05-07T20:32:49.7752352Z 2025-05-07T20:32:49.7752536Z if contiguous: 2025-05-07T20:32:49.7752781Z x0 = x0.contiguous() 2025-05-07T20:32:49.7753046Z x1 = x1.contiguous() 2025-05-07T20:32:49.7753291Z 2025-05-07T20:32:49.7753488Z if scale_ub is not None: 2025-05-07T20:32:49.7753770Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.7754109Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.7754426Z ) 2025-05-07T20:32:49.7754626Z else: 2025-05-07T20:32:49.7754836Z scale_ub_tensor = None 2025-05-07T20:32:49.7755091Z 2025-05-07T20:32:49.7755326Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.7755647Z op = silu_mul_quant 2025-05-07T20:32:49.7755895Z if compiled: 2025-05-07T20:32:49.7756146Z op = torch.compile(op) 2025-05-07T20:32:49.7756450Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.7756727Z 2025-05-07T20:32:49.7756924Z y_fp8, y_scale = fn() 2025-05-07T20:32:49.7757305Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:49.7757601Z 2025-05-07T20:32:49.7757844Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.7758191Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:49.7758485Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:49.7758804Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:49.7759168Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:49.7759485Z 2025-05-07T20:32:49.7759684Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:49.7759886Z 2025-05-07T20:32:49.7759991Z moe/activation_test.py:126: 2025-05-07T20:32:49.7760296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.7760636Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:49.7760975Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:49.7761794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:49.7762577Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:49.7763183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.7763897Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.7764660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:49.7765400Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:49.7766158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:49.7766864Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:49.7767524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:49.7768068Z fn() 2025-05-07T20:32:49.7768597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:49.7769203Z self.fn.run( 2025-05-07T20:32:49.7769680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.7770234Z kernel = self.compile( 2025-05-07T20:32:49.7770795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.7771471Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.7771960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.7772205Z 2025-05-07T20:32:49.7772419Z self = 2025-05-07T20:32:49.7773550Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.7775002Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0ccb9620>} 2025-05-07T20:32:49.7776402Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.7777476Z context = 2025-05-07T20:32:49.7777780Z 2025-05-07T20:32:49.7777953Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.7778494Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.7779065Z module_map=module_map) 2025-05-07T20:32:49.7779444Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.7779814Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:49.7780088Z E ^ 2025-05-07T20:32:49.7780559Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.7781034Z 2025-05-07T20:32:49.7781468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.7781997Z 2025-05-07T20:32:49.7782108Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.7782537Z self=, 2025-05-07T20:32:49.7782948Z T=1, 2025-05-07T20:32:49.7783139Z D=5120, 2025-05-07T20:32:49.7783339Z scale_ub=1200.0, 2025-05-07T20:32:49.7783564Z contiguous=True, 2025-05-07T20:32:49.7783791Z compiled=True, 2025-05-07T20:32:49.7784005Z ) 2025-05-07T20:32:49.9230456Z self = 2025-05-07T20:32:49.9232306Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:49.9232958Z 2025-05-07T20:32:49.9233118Z @given( 2025-05-07T20:32:49.9233583Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.9234198Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.9234972Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.9235629Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.9236284Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.9236841Z ) 2025-05-07T20:32:49.9237532Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.9238122Z def test_silu_mul_quant( 2025-05-07T20:32:49.9238362Z self, 2025-05-07T20:32:49.9238565Z T: int, 2025-05-07T20:32:49.9245820Z D: int, 2025-05-07T20:32:49.9246089Z scale_ub: Optional[float], 2025-05-07T20:32:49.9246378Z contiguous: bool, 2025-05-07T20:32:49.9246625Z compiled: bool, 2025-05-07T20:32:49.9246855Z ) -> None: 2025-05-07T20:32:49.9247076Z torch.manual_seed(2025) 2025-05-07T20:32:49.9247329Z 2025-05-07T20:32:49.9247608Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.9247969Z 2025-05-07T20:32:49.9248170Z x_sign = torch.sign(x) 2025-05-07T20:32:49.9248472Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.9248784Z x = x_sign * x_clamp 2025-05-07T20:32:49.9249033Z x0 = x[:, :D] 2025-05-07T20:32:49.9249258Z x1 = x[:, D:] 2025-05-07T20:32:49.9249466Z 2025-05-07T20:32:49.9249661Z if contiguous: 2025-05-07T20:32:49.9249904Z x0 = x0.contiguous() 2025-05-07T20:32:49.9250165Z x1 = x1.contiguous() 2025-05-07T20:32:49.9250409Z 2025-05-07T20:32:49.9250609Z if scale_ub is not None: 2025-05-07T20:32:49.9250882Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.9251231Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.9251559Z ) 2025-05-07T20:32:49.9251751Z else: 2025-05-07T20:32:49.9252065Z scale_ub_tensor = None 2025-05-07T20:32:49.9252325Z 2025-05-07T20:32:49.9252558Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.9252884Z op = silu_mul_quant 2025-05-07T20:32:49.9253140Z if compiled: 2025-05-07T20:32:49.9253391Z op = torch.compile(op) 2025-05-07T20:32:49.9253685Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.9253966Z 2025-05-07T20:32:49.9254166Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.9254334Z 2025-05-07T20:32:49.9254436Z moe/activation_test.py:117: 2025-05-07T20:32:49.9254742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.9255217Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.9255509Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.9256092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:49.9256674Z return fn(*args, **kwargs) 2025-05-07T20:32:49.9257408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.9258112Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.9258666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.9259369Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.9260052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.9260603Z kernel = self.compile( 2025-05-07T20:32:49.9261164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.9261890Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.9262299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.9262541Z 2025-05-07T20:32:49.9262755Z self = 2025-05-07T20:32:49.9263924Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.9265363Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c844720>} 2025-05-07T20:32:49.9266799Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.9267911Z context = 2025-05-07T20:32:49.9268219Z 2025-05-07T20:32:49.9268387Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.9268933Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.9269412Z module_map=module_map) 2025-05-07T20:32:49.9269786Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.9270154Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.9270419Z E ^ 2025-05-07T20:32:49.9270895Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.9271366Z 2025-05-07T20:32:49.9271802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.9272332Z 2025-05-07T20:32:49.9272438Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.9272861Z self=, 2025-05-07T20:32:49.9273271Z T=1, 2025-05-07T20:32:49.9273461Z D=5120, 2025-05-07T20:32:49.9273658Z scale_ub=None, 2025-05-07T20:32:49.9273876Z contiguous=False, 2025-05-07T20:32:49.9274107Z compiled=True, 2025-05-07T20:32:49.9274319Z ) 2025-05-07T20:32:50.1412483Z self = 2025-05-07T20:32:50.1413229Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:50.1413544Z 2025-05-07T20:32:50.1413633Z @given( 2025-05-07T20:32:50.1413862Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.1414176Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.1414798Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.1415128Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.1415457Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.1415753Z ) 2025-05-07T20:32:50.1416104Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.1416556Z def test_silu_mul_quant( 2025-05-07T20:32:50.1416809Z self, 2025-05-07T20:32:50.1417009Z T: int, 2025-05-07T20:32:50.1417204Z D: int, 2025-05-07T20:32:50.1417440Z scale_ub: Optional[float], 2025-05-07T20:32:50.1417753Z contiguous: bool, 2025-05-07T20:32:50.1417989Z compiled: bool, 2025-05-07T20:32:50.1418223Z ) -> None: 2025-05-07T20:32:50.1418442Z torch.manual_seed(2025) 2025-05-07T20:32:50.1418685Z 2025-05-07T20:32:50.1418960Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.1419308Z 2025-05-07T20:32:50.1419501Z x_sign = torch.sign(x) 2025-05-07T20:32:50.1419803Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.1420119Z x = x_sign * x_clamp 2025-05-07T20:32:50.1420449Z x0 = x[:, :D] 2025-05-07T20:32:50.1420673Z x1 = x[:, D:] 2025-05-07T20:32:50.1420888Z 2025-05-07T20:32:50.1421075Z if contiguous: 2025-05-07T20:32:50.1421311Z x0 = x0.contiguous() 2025-05-07T20:32:50.1421691Z x1 = x1.contiguous() 2025-05-07T20:32:50.1421930Z 2025-05-07T20:32:50.1422127Z if scale_ub is not None: 2025-05-07T20:32:50.1422406Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.1422748Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.1423056Z ) 2025-05-07T20:32:50.1423341Z else: 2025-05-07T20:32:50.1423557Z scale_ub_tensor = None 2025-05-07T20:32:50.1423805Z 2025-05-07T20:32:50.1424042Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.1424367Z op = silu_mul_quant 2025-05-07T20:32:50.1424619Z if compiled: 2025-05-07T20:32:50.1424879Z op = torch.compile(op) 2025-05-07T20:32:50.1425182Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.1425458Z 2025-05-07T20:32:50.1425663Z y_fp8, y_scale = fn() 2025-05-07T20:32:50.1425954Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:50.1426250Z 2025-05-07T20:32:50.1426492Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.1426830Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:50.1427127Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:50.1427441Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:50.1427808Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:50.1428123Z 2025-05-07T20:32:50.1428320Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:50.1428521Z 2025-05-07T20:32:50.1428626Z moe/activation_test.py:126: 2025-05-07T20:32:50.1428932Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.1429269Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:50.1429605Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:50.1430421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:50.1431200Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:50.1431753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.1432454Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.1433167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:50.1433958Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:50.1434708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:50.1435366Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:50.1435984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:50.1436513Z fn() 2025-05-07T20:32:50.1437034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:50.1437635Z self.fn.run( 2025-05-07T20:32:50.1438115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.1438656Z kernel = self.compile( 2025-05-07T20:32:50.1439210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.1439887Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.1440287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.1440529Z 2025-05-07T20:32:50.1440786Z self = 2025-05-07T20:32:50.1441906Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.1443396Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c846c00>} 2025-05-07T20:32:50.1444828Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.1445888Z context = 2025-05-07T20:32:50.1446190Z 2025-05-07T20:32:50.1446363Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.1446902Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.1447380Z module_map=module_map) 2025-05-07T20:32:50.1447748Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.1448110Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:50.1448382Z E ^ 2025-05-07T20:32:50.1448853Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.1449328Z 2025-05-07T20:32:50.1449754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.1450289Z 2025-05-07T20:32:50.1450395Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.1450818Z self=, 2025-05-07T20:32:50.1451224Z T=1, 2025-05-07T20:32:50.1451415Z D=5120, 2025-05-07T20:32:50.1451612Z scale_ub=None, 2025-05-07T20:32:50.1451926Z contiguous=True, 2025-05-07T20:32:50.1452157Z compiled=False, 2025-05-07T20:32:50.1452367Z ) 2025-05-07T20:32:50.2963655Z self = 2025-05-07T20:32:50.2964416Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:50.2964798Z 2025-05-07T20:32:50.2964899Z @given( 2025-05-07T20:32:50.2965139Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.2965450Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.2965767Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.2966103Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.2966709Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.2967010Z ) 2025-05-07T20:32:50.2967360Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.2967818Z def test_silu_mul_quant( 2025-05-07T20:32:50.2968059Z self, 2025-05-07T20:32:50.2968268Z T: int, 2025-05-07T20:32:50.2968458Z D: int, 2025-05-07T20:32:50.2968678Z scale_ub: Optional[float], 2025-05-07T20:32:50.2968958Z contiguous: bool, 2025-05-07T20:32:50.2969190Z compiled: bool, 2025-05-07T20:32:50.2969415Z ) -> None: 2025-05-07T20:32:50.2969630Z torch.manual_seed(2025) 2025-05-07T20:32:50.2969876Z 2025-05-07T20:32:50.2970145Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.2970497Z 2025-05-07T20:32:50.2970694Z x_sign = torch.sign(x) 2025-05-07T20:32:50.2970984Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.2971300Z x = x_sign * x_clamp 2025-05-07T20:32:50.2971546Z x0 = x[:, :D] 2025-05-07T20:32:50.2971753Z x1 = x[:, D:] 2025-05-07T20:32:50.2972096Z 2025-05-07T20:32:50.2972275Z if contiguous: 2025-05-07T20:32:50.2972593Z x0 = x0.contiguous() 2025-05-07T20:32:50.2972854Z x1 = x1.contiguous() 2025-05-07T20:32:50.2973088Z 2025-05-07T20:32:50.2973274Z if scale_ub is not None: 2025-05-07T20:32:50.2973618Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.2973947Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.2974260Z ) 2025-05-07T20:32:50.2974452Z else: 2025-05-07T20:32:50.2974658Z scale_ub_tensor = None 2025-05-07T20:32:50.2974910Z 2025-05-07T20:32:50.2975216Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.2975535Z op = silu_mul_quant 2025-05-07T20:32:50.2975786Z if compiled: 2025-05-07T20:32:50.2976034Z op = torch.compile(op) 2025-05-07T20:32:50.2976333Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.2976612Z 2025-05-07T20:32:50.2976805Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.2976970Z 2025-05-07T20:32:50.2977078Z moe/activation_test.py:117: 2025-05-07T20:32:50.2977374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.2977715Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.2978003Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.2978711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.2979424Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.2979977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.2980685Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.2981365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.2981915Z kernel = self.compile( 2025-05-07T20:32:50.2982475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.2983147Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.2983557Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.2983800Z 2025-05-07T20:32:50.2984014Z self = 2025-05-07T20:32:50.2985136Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.2986639Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c8476a0>} 2025-05-07T20:32:50.2988037Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.2989097Z context = 2025-05-07T20:32:50.2989396Z 2025-05-07T20:32:50.2989576Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.2990114Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.2990588Z module_map=module_map) 2025-05-07T20:32:50.2990964Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.2991329Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.2991589Z E ^ 2025-05-07T20:32:50.2992069Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.2992541Z 2025-05-07T20:32:50.2993013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.2993543Z 2025-05-07T20:32:50.2993653Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.2994072Z self=, 2025-05-07T20:32:50.2994530Z T=128, 2025-05-07T20:32:50.2994721Z D=5120, 2025-05-07T20:32:50.2994909Z scale_ub=None, 2025-05-07T20:32:50.2995126Z contiguous=False, 2025-05-07T20:32:50.2995353Z compiled=True, 2025-05-07T20:32:50.2995556Z ) 2025-05-07T20:32:50.2995884Z self = 2025-05-07T20:32:50.2996436Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:50.2996710Z 2025-05-07T20:32:50.2996793Z @given( 2025-05-07T20:32:50.2997022Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.2997344Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.2997683Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.2998036Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.2998372Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.2998664Z ) 2025-05-07T20:32:50.2999017Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.2999470Z def test_silu_mul_quant( 2025-05-07T20:32:50.2999718Z self, 2025-05-07T20:32:50.2999906Z T: int, 2025-05-07T20:32:50.3000107Z D: int, 2025-05-07T20:32:50.3000328Z scale_ub: Optional[float], 2025-05-07T20:32:50.3000605Z contiguous: bool, 2025-05-07T20:32:50.3000840Z compiled: bool, 2025-05-07T20:32:50.3001063Z ) -> None: 2025-05-07T20:32:50.3001278Z torch.manual_seed(2025) 2025-05-07T20:32:50.3001516Z 2025-05-07T20:32:50.3001795Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.3002143Z 2025-05-07T20:32:50.3002334Z x_sign = torch.sign(x) 2025-05-07T20:32:50.3002632Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.3002945Z x = x_sign * x_clamp 2025-05-07T20:32:50.3003180Z x0 = x[:, :D] 2025-05-07T20:32:50.3003397Z x1 = x[:, D:] 2025-05-07T20:32:50.3003613Z 2025-05-07T20:32:50.3003796Z if contiguous: 2025-05-07T20:32:50.3004030Z x0 = x0.contiguous() 2025-05-07T20:32:50.3004294Z x1 = x1.contiguous() 2025-05-07T20:32:50.3004533Z 2025-05-07T20:32:50.3004726Z if scale_ub is not None: 2025-05-07T20:32:50.3005002Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.3005339Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.3005655Z ) 2025-05-07T20:32:50.3005848Z else: 2025-05-07T20:32:50.3006111Z scale_ub_tensor = None 2025-05-07T20:32:50.3006719Z 2025-05-07T20:32:50.3006956Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.3007281Z op = silu_mul_quant 2025-05-07T20:32:50.3007530Z if compiled: 2025-05-07T20:32:50.3007812Z op = torch.compile(op) 2025-05-07T20:32:50.3008135Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.3008411Z 2025-05-07T20:32:50.3008604Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.3008769Z 2025-05-07T20:32:50.3008873Z moe/activation_test.py:117: 2025-05-07T20:32:50.3009166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.3009503Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.3009792Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.3010362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:50.3010928Z return fn(*args, **kwargs) 2025-05-07T20:32:50.3011607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.3012497Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.3013045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.3013743Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.3014484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.3015033Z kernel = self.compile( 2025-05-07T20:32:50.3015582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.3016320Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.3016735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.3016970Z 2025-05-07T20:32:50.3017190Z self = 2025-05-07T20:32:50.3018373Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.3019820Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c8451c0>} 2025-05-07T20:32:50.3021230Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.3022292Z context = 2025-05-07T20:32:50.3022590Z 2025-05-07T20:32:50.3022766Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.3023302Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.3023783Z module_map=module_map) 2025-05-07T20:32:50.3024157Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.3024510Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.3024774Z E ^ 2025-05-07T20:32:50.3025251Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.3025713Z 2025-05-07T20:32:50.3026141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.3026687Z 2025-05-07T20:32:50.3026790Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.3027222Z self=, 2025-05-07T20:32:50.3027709Z T=128, 2025-05-07T20:32:50.3027897Z D=7168, 2025-05-07T20:32:50.3028091Z scale_ub=1200.0, 2025-05-07T20:32:50.3028318Z contiguous=False, 2025-05-07T20:32:50.3028544Z compiled=False, 2025-05-07T20:32:50.3028755Z ) 2025-05-07T20:32:50.4169827Z self = 2025-05-07T20:32:50.4170412Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:50.4170714Z 2025-05-07T20:32:50.4170792Z @given( 2025-05-07T20:32:50.4171026Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.4171349Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.4171653Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.4172070Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.4172444Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.4172758Z ) 2025-05-07T20:32:50.4173162Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.4173678Z def test_silu_mul_quant( 2025-05-07T20:32:50.4173939Z self, 2025-05-07T20:32:50.4174141Z T: int, 2025-05-07T20:32:50.4174618Z D: int, 2025-05-07T20:32:50.4174841Z scale_ub: Optional[float], 2025-05-07T20:32:50.4175109Z contiguous: bool, 2025-05-07T20:32:50.4175346Z compiled: bool, 2025-05-07T20:32:50.4175654Z ) -> None: 2025-05-07T20:32:50.4175865Z torch.manual_seed(2025) 2025-05-07T20:32:50.4176111Z 2025-05-07T20:32:50.4176390Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.4176735Z 2025-05-07T20:32:50.4176931Z x_sign = torch.sign(x) 2025-05-07T20:32:50.4177254Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.4177645Z x = x_sign * x_clamp 2025-05-07T20:32:50.4177891Z x0 = x[:, :D] 2025-05-07T20:32:50.4178105Z x1 = x[:, D:] 2025-05-07T20:32:50.4178316Z 2025-05-07T20:32:50.4178506Z if contiguous: 2025-05-07T20:32:50.4178733Z x0 = x0.contiguous() 2025-05-07T20:32:50.4178997Z x1 = x1.contiguous() 2025-05-07T20:32:50.4179244Z 2025-05-07T20:32:50.4179433Z if scale_ub is not None: 2025-05-07T20:32:50.4179899Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.4180244Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.4180559Z ) 2025-05-07T20:32:50.4180754Z else: 2025-05-07T20:32:50.4180986Z scale_ub_tensor = None 2025-05-07T20:32:50.4188540Z 2025-05-07T20:32:50.4188805Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.4189137Z op = silu_mul_quant 2025-05-07T20:32:50.4189404Z if compiled: 2025-05-07T20:32:50.4189648Z op = torch.compile(op) 2025-05-07T20:32:50.4189952Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.4190232Z 2025-05-07T20:32:50.4190429Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.4190604Z 2025-05-07T20:32:50.4190706Z moe/activation_test.py:117: 2025-05-07T20:32:50.4191120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.4191562Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.4191857Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.4192576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.4193297Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.4193848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.4194558Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.4195249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.4195802Z kernel = self.compile( 2025-05-07T20:32:50.4196504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.4197241Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.4197654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.4197891Z 2025-05-07T20:32:50.4198107Z self = 2025-05-07T20:32:50.4199233Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.4200683Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0ccb82c0>} 2025-05-07T20:32:50.4202091Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.4203210Z context = 2025-05-07T20:32:50.4203509Z 2025-05-07T20:32:50.4203678Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.4204259Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.4204744Z module_map=module_map) 2025-05-07T20:32:50.4205114Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.4205476Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.4205789Z E ^ 2025-05-07T20:32:50.4206653Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.4207126Z 2025-05-07T20:32:50.4207563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.4208103Z 2025-05-07T20:32:50.4208209Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.4208638Z self=, 2025-05-07T20:32:50.4209053Z T=128, 2025-05-07T20:32:50.4209239Z D=5120, 2025-05-07T20:32:50.4209437Z scale_ub=None, 2025-05-07T20:32:50.4209657Z contiguous=False, 2025-05-07T20:32:50.4209884Z compiled=False, 2025-05-07T20:32:50.4210096Z ) 2025-05-07T20:32:50.4210423Z self = 2025-05-07T20:32:50.4210927Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:50.4211219Z 2025-05-07T20:32:50.4211298Z @given( 2025-05-07T20:32:50.4211534Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.4211916Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.4212234Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.4212574Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.4212917Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.4213206Z ) 2025-05-07T20:32:50.4213569Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.4214031Z def test_silu_mul_quant( 2025-05-07T20:32:50.4214281Z self, 2025-05-07T20:32:50.4214486Z T: int, 2025-05-07T20:32:50.4214698Z D: int, 2025-05-07T20:32:50.4214926Z scale_ub: Optional[float], 2025-05-07T20:32:50.4215226Z contiguous: bool, 2025-05-07T20:32:50.4215487Z compiled: bool, 2025-05-07T20:32:50.4215726Z ) -> None: 2025-05-07T20:32:50.4215961Z torch.manual_seed(2025) 2025-05-07T20:32:50.4216228Z 2025-05-07T20:32:50.4216524Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.4216922Z 2025-05-07T20:32:50.4217209Z x_sign = torch.sign(x) 2025-05-07T20:32:50.4217505Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.4217828Z x = x_sign * x_clamp 2025-05-07T20:32:50.4218077Z x0 = x[:, :D] 2025-05-07T20:32:50.4218301Z x1 = x[:, D:] 2025-05-07T20:32:50.4218502Z 2025-05-07T20:32:50.4218694Z if contiguous: 2025-05-07T20:32:50.4218930Z x0 = x0.contiguous() 2025-05-07T20:32:50.4219196Z x1 = x1.contiguous() 2025-05-07T20:32:50.4219440Z 2025-05-07T20:32:50.4219635Z if scale_ub is not None: 2025-05-07T20:32:50.4219910Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.4220248Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.4220561Z ) 2025-05-07T20:32:50.4220752Z else: 2025-05-07T20:32:50.4220965Z scale_ub_tensor = None 2025-05-07T20:32:50.4221218Z 2025-05-07T20:32:50.4221451Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.4221773Z op = silu_mul_quant 2025-05-07T20:32:50.4222034Z if compiled: 2025-05-07T20:32:50.4222282Z op = torch.compile(op) 2025-05-07T20:32:50.4222656Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.4222941Z 2025-05-07T20:32:50.4223133Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.4223306Z 2025-05-07T20:32:50.4223474Z moe/activation_test.py:117: 2025-05-07T20:32:50.4223780Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.4224119Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.4224398Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.4225112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.4225895Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.4226448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.4227157Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.4227898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.4228448Z kernel = self.compile( 2025-05-07T20:32:50.4229000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.4229687Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.4230097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.4230334Z 2025-05-07T20:32:50.4230554Z self = 2025-05-07T20:32:50.4231671Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.4233099Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c840400>} 2025-05-07T20:32:50.4234500Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.4235592Z context = 2025-05-07T20:32:50.4235894Z 2025-05-07T20:32:50.4236072Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.4236618Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.4237109Z module_map=module_map) 2025-05-07T20:32:50.4237565Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.4237928Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.4238199Z E ^ 2025-05-07T20:32:50.4238687Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.4239163Z 2025-05-07T20:32:50.4239609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.4240156Z 2025-05-07T20:32:50.4240260Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.4240692Z self=, 2025-05-07T20:32:50.4241117Z T=128, 2025-05-07T20:32:50.4241307Z D=5120, 2025-05-07T20:32:50.4241507Z scale_ub=1200.0, 2025-05-07T20:32:50.4241734Z contiguous=True, 2025-05-07T20:32:50.4241954Z compiled=False, 2025-05-07T20:32:50.4242163Z ) 2025-05-07T20:32:50.5969726Z self = 2025-05-07T20:32:50.5970279Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:50.5970570Z 2025-05-07T20:32:50.5970651Z @given( 2025-05-07T20:32:50.5971133Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.5971457Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.5971763Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.5972254Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.5972590Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.5972876Z ) 2025-05-07T20:32:50.5973234Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.5973692Z def test_silu_mul_quant( 2025-05-07T20:32:50.5974024Z self, 2025-05-07T20:32:50.5974228Z T: int, 2025-05-07T20:32:50.5974434Z D: int, 2025-05-07T20:32:50.5974653Z scale_ub: Optional[float], 2025-05-07T20:32:50.5974938Z contiguous: bool, 2025-05-07T20:32:50.5975190Z compiled: bool, 2025-05-07T20:32:50.5975428Z ) -> None: 2025-05-07T20:32:50.5975648Z torch.manual_seed(2025) 2025-05-07T20:32:50.5975903Z 2025-05-07T20:32:50.5976188Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.5976534Z 2025-05-07T20:32:50.5976737Z x_sign = torch.sign(x) 2025-05-07T20:32:50.5977046Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.5977364Z x = x_sign * x_clamp 2025-05-07T20:32:50.5977618Z x0 = x[:, :D] 2025-05-07T20:32:50.5977846Z x1 = x[:, D:] 2025-05-07T20:32:50.5978058Z 2025-05-07T20:32:50.5978254Z if contiguous: 2025-05-07T20:32:50.5978499Z x0 = x0.contiguous() 2025-05-07T20:32:50.5978767Z x1 = x1.contiguous() 2025-05-07T20:32:50.5979015Z 2025-05-07T20:32:50.5979216Z if scale_ub is not None: 2025-05-07T20:32:50.5979492Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.5979838Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.5980156Z ) 2025-05-07T20:32:50.5980355Z else: 2025-05-07T20:32:50.5980569Z scale_ub_tensor = None 2025-05-07T20:32:50.5980828Z 2025-05-07T20:32:50.5981065Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.5981386Z op = silu_mul_quant 2025-05-07T20:32:50.5981646Z if compiled: 2025-05-07T20:32:50.5981902Z op = torch.compile(op) 2025-05-07T20:32:50.5982202Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.5982486Z 2025-05-07T20:32:50.5982685Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.5982853Z 2025-05-07T20:32:50.5982959Z moe/activation_test.py:117: 2025-05-07T20:32:50.5983266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.5983611Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.5983982Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.5984712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.5985440Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.5985995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.5986695Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.5987384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.5987955Z kernel = self.compile( 2025-05-07T20:32:50.5988513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.5989196Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.5989617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.5989853Z 2025-05-07T20:32:50.5990079Z self = 2025-05-07T20:32:50.5991241Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.5992726Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c841300>} 2025-05-07T20:32:50.5994119Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.5995224Z context = 2025-05-07T20:32:50.5995522Z 2025-05-07T20:32:50.5995702Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.5996238Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.5996723Z module_map=module_map) 2025-05-07T20:32:50.5997123Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.5997489Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.5997755Z E ^ 2025-05-07T20:32:50.5998235Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.5998704Z 2025-05-07T20:32:50.5999131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.5999667Z 2025-05-07T20:32:50.5999784Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.6000207Z self=, 2025-05-07T20:32:50.6000628Z T=1, 2025-05-07T20:32:50.6000820Z D=7168, 2025-05-07T20:32:50.6001016Z scale_ub=1200.0, 2025-05-07T20:32:50.6001249Z contiguous=True, 2025-05-07T20:32:50.6001483Z compiled=True, 2025-05-07T20:32:50.6001690Z ) 2025-05-07T20:32:50.6002021Z self = 2025-05-07T20:32:50.6002529Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:50.6002799Z 2025-05-07T20:32:50.6002886Z @given( 2025-05-07T20:32:50.6003120Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.6003440Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.6003755Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.6004090Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.6004434Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.6004730Z ) 2025-05-07T20:32:50.6005137Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.6005597Z def test_silu_mul_quant( 2025-05-07T20:32:50.6005847Z self, 2025-05-07T20:32:50.6006048Z T: int, 2025-05-07T20:32:50.6006509Z D: int, 2025-05-07T20:32:50.6006824Z scale_ub: Optional[float], 2025-05-07T20:32:50.6007128Z contiguous: bool, 2025-05-07T20:32:50.6007376Z compiled: bool, 2025-05-07T20:32:50.6007613Z ) -> None: 2025-05-07T20:32:50.6007837Z torch.manual_seed(2025) 2025-05-07T20:32:50.6008083Z 2025-05-07T20:32:50.6008365Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.6008718Z 2025-05-07T20:32:50.6008914Z x_sign = torch.sign(x) 2025-05-07T20:32:50.6009217Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.6009543Z x = x_sign * x_clamp 2025-05-07T20:32:50.6009786Z x0 = x[:, :D] 2025-05-07T20:32:50.6010017Z x1 = x[:, D:] 2025-05-07T20:32:50.6010232Z 2025-05-07T20:32:50.6010428Z if contiguous: 2025-05-07T20:32:50.6010669Z x0 = x0.contiguous() 2025-05-07T20:32:50.6010936Z x1 = x1.contiguous() 2025-05-07T20:32:50.6011261Z 2025-05-07T20:32:50.6011465Z if scale_ub is not None: 2025-05-07T20:32:50.6011746Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.6012190Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.6012571Z ) 2025-05-07T20:32:50.6012772Z else: 2025-05-07T20:32:50.6012994Z scale_ub_tensor = None 2025-05-07T20:32:50.6013245Z 2025-05-07T20:32:50.6013489Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.6013815Z op = silu_mul_quant 2025-05-07T20:32:50.6014134Z if compiled: 2025-05-07T20:32:50.6014386Z op = torch.compile(op) 2025-05-07T20:32:50.6014689Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.6014967Z 2025-05-07T20:32:50.6015173Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.6015343Z 2025-05-07T20:32:50.6015452Z moe/activation_test.py:117: 2025-05-07T20:32:50.6015752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.6016096Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.6016387Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.6016959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:50.6017534Z return fn(*args, **kwargs) 2025-05-07T20:32:50.6018217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.6018927Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.6019477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.6020183Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.6020871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.6021432Z kernel = self.compile( 2025-05-07T20:32:50.6021984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.6022667Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.6023081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.6023317Z 2025-05-07T20:32:50.6023538Z self = 2025-05-07T20:32:50.6024653Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.6026155Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c842ac0>} 2025-05-07T20:32:50.6027554Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.6028619Z context = 2025-05-07T20:32:50.6028917Z 2025-05-07T20:32:50.6029088Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.6029628Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.6030113Z module_map=module_map) 2025-05-07T20:32:50.6030491Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.6030849Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.6031119Z E ^ 2025-05-07T20:32:50.6031604Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.6032072Z 2025-05-07T20:32:50.6032550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.6033090Z 2025-05-07T20:32:50.6033199Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.6033669Z self=, 2025-05-07T20:32:50.6034088Z T=1, 2025-05-07T20:32:50.6034275Z D=7168, 2025-05-07T20:32:50.6034480Z scale_ub=1200.0, 2025-05-07T20:32:50.6034715Z contiguous=False, 2025-05-07T20:32:50.6034944Z compiled=True, 2025-05-07T20:32:50.6035230Z ) 2025-05-07T20:32:50.7363726Z self = 2025-05-07T20:32:50.7364798Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:50.7365344Z 2025-05-07T20:32:50.7365527Z @given( 2025-05-07T20:32:50.7365978Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.7366617Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.7367240Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.7367782Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.7368113Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.7368407Z ) 2025-05-07T20:32:50.7368758Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.7369210Z def test_silu_mul_quant( 2025-05-07T20:32:50.7369457Z self, 2025-05-07T20:32:50.7369648Z T: int, 2025-05-07T20:32:50.7369848Z D: int, 2025-05-07T20:32:50.7370079Z scale_ub: Optional[float], 2025-05-07T20:32:50.7370348Z contiguous: bool, 2025-05-07T20:32:50.7370599Z compiled: bool, 2025-05-07T20:32:50.7370829Z ) -> None: 2025-05-07T20:32:50.7371054Z torch.manual_seed(2025) 2025-05-07T20:32:50.7371296Z 2025-05-07T20:32:50.7371571Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.7371981Z 2025-05-07T20:32:50.7372173Z x_sign = torch.sign(x) 2025-05-07T20:32:50.7372468Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.7372783Z x = x_sign * x_clamp 2025-05-07T20:32:50.7373028Z x0 = x[:, :D] 2025-05-07T20:32:50.7373253Z x1 = x[:, D:] 2025-05-07T20:32:50.7373466Z 2025-05-07T20:32:50.7373651Z if contiguous: 2025-05-07T20:32:50.7373889Z x0 = x0.contiguous() 2025-05-07T20:32:50.7374156Z x1 = x1.contiguous() 2025-05-07T20:32:50.7374391Z 2025-05-07T20:32:50.7374591Z if scale_ub is not None: 2025-05-07T20:32:50.7374872Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.7375207Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.7375529Z ) 2025-05-07T20:32:50.7375979Z else: 2025-05-07T20:32:50.7376198Z scale_ub_tensor = None 2025-05-07T20:32:50.7376459Z 2025-05-07T20:32:50.7376701Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.7377032Z op = silu_mul_quant 2025-05-07T20:32:50.7377286Z if compiled: 2025-05-07T20:32:50.7377554Z op = torch.compile(op) 2025-05-07T20:32:50.7377902Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.7378179Z 2025-05-07T20:32:50.7378380Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.7378550Z 2025-05-07T20:32:50.7378660Z moe/activation_test.py:117: 2025-05-07T20:32:50.7378964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.7379311Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.7379602Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.7380183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:50.7380758Z return fn(*args, **kwargs) 2025-05-07T20:32:50.7381519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.7382230Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.7382777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.7383553Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.7384240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.7384790Z kernel = self.compile( 2025-05-07T20:32:50.7385416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.7386094Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.7386510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.7386747Z 2025-05-07T20:32:50.7386967Z self = 2025-05-07T20:32:50.7388095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.7389541Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c86f9c0>} 2025-05-07T20:32:50.7390936Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.7392006Z context = 2025-05-07T20:32:50.7392304Z 2025-05-07T20:32:50.7392474Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.7393024Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.7393505Z module_map=module_map) 2025-05-07T20:32:50.7393882Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.7394246Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.7394514Z E ^ 2025-05-07T20:32:50.7394994Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.7395459Z 2025-05-07T20:32:50.7395889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.7396430Z 2025-05-07T20:32:50.7396537Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.7397013Z self=, 2025-05-07T20:32:50.7397433Z T=1, 2025-05-07T20:32:50.7397629Z D=7168, 2025-05-07T20:32:50.7397830Z scale_ub=None, 2025-05-07T20:32:50.7398059Z contiguous=False, 2025-05-07T20:32:50.7398303Z compiled=True, 2025-05-07T20:32:50.7405512Z ) 2025-05-07T20:32:50.8268548Z self = 2025-05-07T20:32:50.8269312Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:50.8269618Z 2025-05-07T20:32:50.8269703Z @given( 2025-05-07T20:32:50.8269945Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.8270265Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.8270572Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.8270921Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.8271255Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.8271535Z ) 2025-05-07T20:32:50.8271898Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.8272358Z def test_silu_mul_quant( 2025-05-07T20:32:50.8272804Z self, 2025-05-07T20:32:50.8273005Z T: int, 2025-05-07T20:32:50.8273207Z D: int, 2025-05-07T20:32:50.8273434Z scale_ub: Optional[float], 2025-05-07T20:32:50.8273706Z contiguous: bool, 2025-05-07T20:32:50.8274040Z compiled: bool, 2025-05-07T20:32:50.8274277Z ) -> None: 2025-05-07T20:32:50.8274493Z torch.manual_seed(2025) 2025-05-07T20:32:50.8274747Z 2025-05-07T20:32:50.8275033Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.8275384Z 2025-05-07T20:32:50.8275589Z x_sign = torch.sign(x) 2025-05-07T20:32:50.8275968Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.8276281Z x = x_sign * x_clamp 2025-05-07T20:32:50.8276526Z x0 = x[:, :D] 2025-05-07T20:32:50.8276752Z x1 = x[:, D:] 2025-05-07T20:32:50.8276957Z 2025-05-07T20:32:50.8277148Z if contiguous: 2025-05-07T20:32:50.8277385Z x0 = x0.contiguous() 2025-05-07T20:32:50.8277650Z x1 = x1.contiguous() 2025-05-07T20:32:50.8277890Z 2025-05-07T20:32:50.8278084Z if scale_ub is not None: 2025-05-07T20:32:50.8278361Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.8278703Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.8279025Z ) 2025-05-07T20:32:50.8279225Z else: 2025-05-07T20:32:50.8279436Z scale_ub_tensor = None 2025-05-07T20:32:50.8279695Z 2025-05-07T20:32:50.8279932Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.8280255Z op = silu_mul_quant 2025-05-07T20:32:50.8280512Z if compiled: 2025-05-07T20:32:50.8280767Z op = torch.compile(op) 2025-05-07T20:32:50.8281065Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.8281355Z 2025-05-07T20:32:50.8281553Z y_fp8, y_scale = fn() 2025-05-07T20:32:50.8281842Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:50.8282151Z 2025-05-07T20:32:50.8282392Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.8282737Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:50.8283034Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:50.8283359Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:50.8283731Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:50.8284042Z 2025-05-07T20:32:50.8284250Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:50.8284450Z 2025-05-07T20:32:50.8284563Z moe/activation_test.py:126: 2025-05-07T20:32:50.8284866Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.8285215Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:50.8285654Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:50.8286484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:50.8287260Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:50.8287830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.8288542Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.8289254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:50.8290004Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:50.8290765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:50.8291434Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:50.8292141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:50.8292740Z fn() 2025-05-07T20:32:50.8293280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:50.8293886Z self.fn.run( 2025-05-07T20:32:50.8294412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.8294969Z kernel = self.compile( 2025-05-07T20:32:50.8295530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.8296207Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.8296675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.8296920Z 2025-05-07T20:32:50.8297142Z self = 2025-05-07T20:32:50.8298277Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.8299725Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0d92cb80>} 2025-05-07T20:32:50.8301268Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.8302351Z context = 2025-05-07T20:32:50.8302652Z 2025-05-07T20:32:50.8302830Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.8303379Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.8303862Z module_map=module_map) 2025-05-07T20:32:50.8304243Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.8304619Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:50.8304890Z E ^ 2025-05-07T20:32:50.8305373Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.8305846Z 2025-05-07T20:32:50.8306594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.8307136Z 2025-05-07T20:32:50.8307249Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.8307685Z self=, 2025-05-07T20:32:50.8308109Z T=1, 2025-05-07T20:32:50.8308299Z D=5120, 2025-05-07T20:32:50.8308590Z scale_ub=1200.0, 2025-05-07T20:32:50.8308827Z contiguous=False, 2025-05-07T20:32:50.8309065Z compiled=True, 2025-05-07T20:32:50.8309273Z ) 2025-05-07T20:32:50.9859160Z self = 2025-05-07T20:32:50.9859966Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:50.9860354Z 2025-05-07T20:32:50.9860463Z @given( 2025-05-07T20:32:50.9860790Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.9861112Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.9861432Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.9861772Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.9862101Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.9862402Z ) 2025-05-07T20:32:50.9862765Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.9863222Z def test_silu_mul_quant( 2025-05-07T20:32:50.9863482Z self, 2025-05-07T20:32:50.9863686Z T: int, 2025-05-07T20:32:50.9863886Z D: int, 2025-05-07T20:32:50.9864111Z scale_ub: Optional[float], 2025-05-07T20:32:50.9864575Z contiguous: bool, 2025-05-07T20:32:50.9864826Z compiled: bool, 2025-05-07T20:32:50.9865066Z ) -> None: 2025-05-07T20:32:50.9865292Z torch.manual_seed(2025) 2025-05-07T20:32:50.9865651Z 2025-05-07T20:32:50.9865925Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.9866280Z 2025-05-07T20:32:50.9866479Z x_sign = torch.sign(x) 2025-05-07T20:32:50.9866774Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.9867093Z x = x_sign * x_clamp 2025-05-07T20:32:50.9867421Z x0 = x[:, :D] 2025-05-07T20:32:50.9867666Z x1 = x[:, D:] 2025-05-07T20:32:50.9867907Z 2025-05-07T20:32:50.9868098Z if contiguous: 2025-05-07T20:32:50.9868334Z x0 = x0.contiguous() 2025-05-07T20:32:50.9868606Z x1 = x1.contiguous() 2025-05-07T20:32:50.9868854Z 2025-05-07T20:32:50.9869045Z if scale_ub is not None: 2025-05-07T20:32:50.9869326Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.9869672Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.9869985Z ) 2025-05-07T20:32:50.9870186Z else: 2025-05-07T20:32:50.9870410Z scale_ub_tensor = None 2025-05-07T20:32:50.9870664Z 2025-05-07T20:32:50.9870903Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.9871228Z op = silu_mul_quant 2025-05-07T20:32:50.9871489Z if compiled: 2025-05-07T20:32:50.9871740Z op = torch.compile(op) 2025-05-07T20:32:50.9872051Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.9872334Z 2025-05-07T20:32:50.9872528Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.9872709Z 2025-05-07T20:32:50.9872814Z moe/activation_test.py:117: 2025-05-07T20:32:50.9873123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.9873464Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.9873758Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.9874343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:50.9874933Z return fn(*args, **kwargs) 2025-05-07T20:32:50.9875613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.9876331Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.9876888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.9877595Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.9878372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.9878929Z kernel = self.compile( 2025-05-07T20:32:50.9879498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.9880172Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.9880589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.9880831Z 2025-05-07T20:32:50.9881053Z self = 2025-05-07T20:32:50.9882183Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.9883630Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0d92de40>} 2025-05-07T20:32:50.9885080Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.9886149Z context = 2025-05-07T20:32:50.9886490Z 2025-05-07T20:32:50.9886669Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.9887207Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.9887690Z module_map=module_map) 2025-05-07T20:32:50.9888072Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.9888488Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.9888754Z E ^ 2025-05-07T20:32:50.9889239Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.9889708Z 2025-05-07T20:32:50.9890149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.9890683Z 2025-05-07T20:32:50.9890795Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.9891222Z self=, 2025-05-07T20:32:50.9891644Z T=1, 2025-05-07T20:32:50.9891919Z D=5120, 2025-05-07T20:32:50.9892116Z scale_ub=1200.0, 2025-05-07T20:32:50.9892364Z contiguous=False, 2025-05-07T20:32:50.9892594Z compiled=False, 2025-05-07T20:32:50.9892811Z ) 2025-05-07T20:32:50.9893147Z self = 2025-05-07T20:32:50.9893667Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:50.9893946Z 2025-05-07T20:32:50.9894028Z @given( 2025-05-07T20:32:50.9894274Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.9894601Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.9894916Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.9895267Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.9895610Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.9895899Z ) 2025-05-07T20:32:50.9896264Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.9896729Z def test_silu_mul_quant( 2025-05-07T20:32:50.9896987Z self, 2025-05-07T20:32:50.9897189Z T: int, 2025-05-07T20:32:50.9897398Z D: int, 2025-05-07T20:32:50.9897626Z scale_ub: Optional[float], 2025-05-07T20:32:50.9897902Z contiguous: bool, 2025-05-07T20:32:50.9898151Z compiled: bool, 2025-05-07T20:32:50.9898381Z ) -> None: 2025-05-07T20:32:50.9898597Z torch.manual_seed(2025) 2025-05-07T20:32:50.9898850Z 2025-05-07T20:32:50.9899185Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.9899533Z 2025-05-07T20:32:50.9899736Z x_sign = torch.sign(x) 2025-05-07T20:32:50.9900035Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.9900353Z x = x_sign * x_clamp 2025-05-07T20:32:50.9900603Z x0 = x[:, :D] 2025-05-07T20:32:50.9900830Z x1 = x[:, D:] 2025-05-07T20:32:50.9901043Z 2025-05-07T20:32:50.9901235Z if contiguous: 2025-05-07T20:32:50.9901478Z x0 = x0.contiguous() 2025-05-07T20:32:50.9901739Z x1 = x1.contiguous() 2025-05-07T20:32:50.9901986Z 2025-05-07T20:32:50.9902192Z if scale_ub is not None: 2025-05-07T20:32:50.9902471Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.9902813Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.9903130Z ) 2025-05-07T20:32:50.9903330Z else: 2025-05-07T20:32:50.9903541Z scale_ub_tensor = None 2025-05-07T20:32:50.9903809Z 2025-05-07T20:32:50.9904047Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.9904370Z op = silu_mul_quant 2025-05-07T20:32:50.9904681Z if compiled: 2025-05-07T20:32:50.9904941Z op = torch.compile(op) 2025-05-07T20:32:50.9905244Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.9905533Z 2025-05-07T20:32:50.9905777Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.9905949Z 2025-05-07T20:32:50.9906052Z moe/activation_test.py:117: 2025-05-07T20:32:50.9906627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.9906975Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.9907270Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.9908076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.9908809Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.9909377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.9910096Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.9910799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.9911365Z kernel = self.compile( 2025-05-07T20:32:50.9911937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.9912621Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.9913047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.9913290Z 2025-05-07T20:32:50.9913514Z self = 2025-05-07T20:32:50.9914645Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.9916075Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0d92eac0>} 2025-05-07T20:32:50.9917480Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.9918550Z context = 2025-05-07T20:32:50.9918849Z 2025-05-07T20:32:50.9919029Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.9919569Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.9920128Z module_map=module_map) 2025-05-07T20:32:50.9920507Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.9920878Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.9921141Z E ^ 2025-05-07T20:32:50.9921624Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.9922096Z 2025-05-07T20:32:50.9922535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.9923067Z 2025-05-07T20:32:50.9923179Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.9923604Z self=, 2025-05-07T20:32:50.9924026Z T=16384, 2025-05-07T20:32:50.9924228Z D=5120, 2025-05-07T20:32:50.9924422Z scale_ub=1200.0, 2025-05-07T20:32:50.9924659Z contiguous=False, 2025-05-07T20:32:50.9924895Z compiled=True, 2025-05-07T20:32:50.9925102Z ) 2025-05-07T20:32:51.0795840Z self = 2025-05-07T20:32:51.0796889Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:51.0797287Z 2025-05-07T20:32:51.0797396Z @given( 2025-05-07T20:32:51.0797699Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.0798072Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.0798467Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.0798796Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.0799129Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.0799418Z ) 2025-05-07T20:32:51.0799769Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.0800304Z def test_silu_mul_quant( 2025-05-07T20:32:51.0800555Z self, 2025-05-07T20:32:51.0800756Z T: int, 2025-05-07T20:32:51.0800951Z D: int, 2025-05-07T20:32:51.0801181Z scale_ub: Optional[float], 2025-05-07T20:32:51.0801460Z contiguous: bool, 2025-05-07T20:32:51.0801696Z compiled: bool, 2025-05-07T20:32:51.0801930Z ) -> None: 2025-05-07T20:32:51.0802150Z torch.manual_seed(2025) 2025-05-07T20:32:51.0802391Z 2025-05-07T20:32:51.0802668Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.0803027Z 2025-05-07T20:32:51.0803218Z x_sign = torch.sign(x) 2025-05-07T20:32:51.0803517Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.0803835Z x = x_sign * x_clamp 2025-05-07T20:32:51.0804075Z x0 = x[:, :D] 2025-05-07T20:32:51.0804296Z x1 = x[:, D:] 2025-05-07T20:32:51.0804511Z 2025-05-07T20:32:51.0804696Z if contiguous: 2025-05-07T20:32:51.0804932Z x0 = x0.contiguous() 2025-05-07T20:32:51.0805194Z x1 = x1.contiguous() 2025-05-07T20:32:51.0805432Z 2025-05-07T20:32:51.0805628Z if scale_ub is not None: 2025-05-07T20:32:51.0805905Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.0806522Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.0806834Z ) 2025-05-07T20:32:51.0807034Z else: 2025-05-07T20:32:51.0807246Z scale_ub_tensor = None 2025-05-07T20:32:51.0807494Z 2025-05-07T20:32:51.0807731Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.0808054Z op = silu_mul_quant 2025-05-07T20:32:51.0808302Z if compiled: 2025-05-07T20:32:51.0808553Z op = torch.compile(op) 2025-05-07T20:32:51.0808857Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.0809161Z 2025-05-07T20:32:51.0809353Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.0809526Z 2025-05-07T20:32:51.0809626Z moe/activation_test.py:117: 2025-05-07T20:32:51.0809930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.0810361Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.0810649Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.0811226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.0811883Z return fn(*args, **kwargs) 2025-05-07T20:32:51.0812560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.0813283Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.0813840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.0814547Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.0815231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.0815789Z kernel = self.compile( 2025-05-07T20:32:51.0816354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.0817132Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.0817551Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.0817795Z 2025-05-07T20:32:51.0818010Z self = 2025-05-07T20:32:51.0819190Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.0820688Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c668180>} 2025-05-07T20:32:51.0822091Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.0823161Z context = 2025-05-07T20:32:51.0823458Z 2025-05-07T20:32:51.0823634Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.0824179Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.0824655Z module_map=module_map) 2025-05-07T20:32:51.0825030Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.0825395Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.0825657Z E ^ 2025-05-07T20:32:51.0826138Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.0826604Z 2025-05-07T20:32:51.0827042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.0827579Z 2025-05-07T20:32:51.0827710Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.0828160Z self=, 2025-05-07T20:32:51.0828576Z T=2048, 2025-05-07T20:32:51.0828768Z D=7168, 2025-05-07T20:32:51.0828958Z scale_ub=1200.0, 2025-05-07T20:32:51.0829189Z contiguous=False, 2025-05-07T20:32:51.0829419Z compiled=True, 2025-05-07T20:32:51.0829623Z ) 2025-05-07T20:32:51.0829952Z self = 2025-05-07T20:32:51.0830463Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:51.0830748Z 2025-05-07T20:32:51.0830833Z @given( 2025-05-07T20:32:51.0831062Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.0831385Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.0831749Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.0832081Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.0832417Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.0832707Z ) 2025-05-07T20:32:51.0833060Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.0833514Z def test_silu_mul_quant( 2025-05-07T20:32:51.0833762Z self, 2025-05-07T20:32:51.0833956Z T: int, 2025-05-07T20:32:51.0834156Z D: int, 2025-05-07T20:32:51.0834376Z scale_ub: Optional[float], 2025-05-07T20:32:51.0834649Z contiguous: bool, 2025-05-07T20:32:51.0834893Z compiled: bool, 2025-05-07T20:32:51.0835123Z ) -> None: 2025-05-07T20:32:51.0835345Z torch.manual_seed(2025) 2025-05-07T20:32:51.0835586Z 2025-05-07T20:32:51.0835867Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.0836219Z 2025-05-07T20:32:51.0836430Z x_sign = torch.sign(x) 2025-05-07T20:32:51.0844115Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.0844443Z x = x_sign * x_clamp 2025-05-07T20:32:51.0844779Z x0 = x[:, :D] 2025-05-07T20:32:51.0845011Z x1 = x[:, D:] 2025-05-07T20:32:51.0845226Z 2025-05-07T20:32:51.0845411Z if contiguous: 2025-05-07T20:32:51.0845652Z x0 = x0.contiguous() 2025-05-07T20:32:51.0845985Z x1 = x1.contiguous() 2025-05-07T20:32:51.0846324Z 2025-05-07T20:32:51.0846574Z if scale_ub is not None: 2025-05-07T20:32:51.0846855Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.0847198Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.0847584Z ) 2025-05-07T20:32:51.0847800Z else: 2025-05-07T20:32:51.0848045Z scale_ub_tensor = None 2025-05-07T20:32:51.0848305Z 2025-05-07T20:32:51.0848544Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.0848866Z op = silu_mul_quant 2025-05-07T20:32:51.0849130Z if compiled: 2025-05-07T20:32:51.0849381Z op = torch.compile(op) 2025-05-07T20:32:51.0849677Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.0849959Z 2025-05-07T20:32:51.0850154Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.0850322Z 2025-05-07T20:32:51.0850428Z moe/activation_test.py:117: 2025-05-07T20:32:51.0850731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.0851073Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.0851362Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.0852008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.0852590Z return fn(*args, **kwargs) 2025-05-07T20:32:51.0853285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.0853995Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.0854547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.0855257Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.0855942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.0856491Z kernel = self.compile( 2025-05-07T20:32:51.0857053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.0857759Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.0858201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.0858436Z 2025-05-07T20:32:51.0858652Z self = 2025-05-07T20:32:51.0859828Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.0861258Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c668ea0>} 2025-05-07T20:32:51.0862657Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.0863726Z context = 2025-05-07T20:32:51.0864027Z 2025-05-07T20:32:51.0864199Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.0864745Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.0865228Z module_map=module_map) 2025-05-07T20:32:51.0865645Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.0866010Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.0866279Z E ^ 2025-05-07T20:32:51.0866755Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.0867269Z 2025-05-07T20:32:51.0867698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.0868236Z 2025-05-07T20:32:51.2017638Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.2018552Z self=, 2025-05-07T20:32:51.2018987Z T=1, 2025-05-07T20:32:51.2019179Z D=5120, 2025-05-07T20:32:51.2019365Z scale_ub=None, 2025-05-07T20:32:51.2019584Z contiguous=False, 2025-05-07T20:32:51.2019821Z compiled=False, 2025-05-07T20:32:51.2020024Z ) 2025-05-07T20:32:51.2020349Z self = 2025-05-07T20:32:51.2020859Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:51.2021123Z 2025-05-07T20:32:51.2021205Z @given( 2025-05-07T20:32:51.2021431Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.2021754Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.2022066Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.2022394Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.2022723Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.2023011Z ) 2025-05-07T20:32:51.2023362Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.2023812Z def test_silu_mul_quant( 2025-05-07T20:32:51.2024055Z self, 2025-05-07T20:32:51.2024250Z T: int, 2025-05-07T20:32:51.2024450Z D: int, 2025-05-07T20:32:51.2024671Z scale_ub: Optional[float], 2025-05-07T20:32:51.2024947Z contiguous: bool, 2025-05-07T20:32:51.2025191Z compiled: bool, 2025-05-07T20:32:51.2025422Z ) -> None: 2025-05-07T20:32:51.2025639Z torch.manual_seed(2025) 2025-05-07T20:32:51.2025881Z 2025-05-07T20:32:51.2026159Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.2026508Z 2025-05-07T20:32:51.2026701Z x_sign = torch.sign(x) 2025-05-07T20:32:51.2026994Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.2027313Z x = x_sign * x_clamp 2025-05-07T20:32:51.2027548Z x0 = x[:, :D] 2025-05-07T20:32:51.2027771Z x1 = x[:, D:] 2025-05-07T20:32:51.2027980Z 2025-05-07T20:32:51.2028167Z if contiguous: 2025-05-07T20:32:51.2028402Z x0 = x0.contiguous() 2025-05-07T20:32:51.2028664Z x1 = x1.contiguous() 2025-05-07T20:32:51.2028998Z 2025-05-07T20:32:51.2029194Z if scale_ub is not None: 2025-05-07T20:32:51.2029470Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.2029805Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.2030121Z ) 2025-05-07T20:32:51.2030318Z else: 2025-05-07T20:32:51.2030532Z scale_ub_tensor = None 2025-05-07T20:32:51.2030788Z 2025-05-07T20:32:51.2031030Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.2031354Z op = silu_mul_quant 2025-05-07T20:32:51.2031605Z if compiled: 2025-05-07T20:32:51.2031857Z op = torch.compile(op) 2025-05-07T20:32:51.2032159Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.2032434Z 2025-05-07T20:32:51.2032633Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.2032800Z 2025-05-07T20:32:51.2032906Z moe/activation_test.py:117: 2025-05-07T20:32:51.2033206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.2033546Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.2033835Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.2034630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.2035339Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.2035960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.2036663Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.2037336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.2037922Z kernel = self.compile( 2025-05-07T20:32:51.2038480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.2039157Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.2039561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.2039800Z 2025-05-07T20:32:51.2040013Z self = 2025-05-07T20:32:51.2041126Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.2042561Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c669e40>} 2025-05-07T20:32:51.2043950Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.2045012Z context = 2025-05-07T20:32:51.2045313Z 2025-05-07T20:32:51.2045482Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.2046020Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.2046494Z module_map=module_map) 2025-05-07T20:32:51.2046864Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.2047224Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.2047488Z E ^ 2025-05-07T20:32:51.2047958Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.2048427Z 2025-05-07T20:32:51.2048853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.2049382Z 2025-05-07T20:32:51.2049593Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.2050023Z self=, 2025-05-07T20:32:51.2050431Z T=4096, 2025-05-07T20:32:51.2050625Z D=7168, 2025-05-07T20:32:51.2050824Z scale_ub=1200.0, 2025-05-07T20:32:51.2051047Z contiguous=False, 2025-05-07T20:32:51.2051281Z compiled=False, 2025-05-07T20:32:51.2051494Z ) 2025-05-07T20:32:51.2051890Z self = 2025-05-07T20:32:51.2052405Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:51.2052689Z 2025-05-07T20:32:51.2052774Z @given( 2025-05-07T20:32:51.2053002Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.2053327Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.2053642Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.2053979Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.2054314Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.2054608Z ) 2025-05-07T20:32:51.2055014Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.2055466Z def test_silu_mul_quant( 2025-05-07T20:32:51.2055714Z self, 2025-05-07T20:32:51.2055913Z T: int, 2025-05-07T20:32:51.2056110Z D: int, 2025-05-07T20:32:51.2056380Z scale_ub: Optional[float], 2025-05-07T20:32:51.2056657Z contiguous: bool, 2025-05-07T20:32:51.2056900Z compiled: bool, 2025-05-07T20:32:51.2057127Z ) -> None: 2025-05-07T20:32:51.2057348Z torch.manual_seed(2025) 2025-05-07T20:32:51.2057589Z 2025-05-07T20:32:51.2057862Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.2058254Z 2025-05-07T20:32:51.2058444Z x_sign = torch.sign(x) 2025-05-07T20:32:51.2058739Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.2059054Z x = x_sign * x_clamp 2025-05-07T20:32:51.2059297Z x0 = x[:, :D] 2025-05-07T20:32:51.2059511Z x1 = x[:, D:] 2025-05-07T20:32:51.2059720Z 2025-05-07T20:32:51.2059912Z if contiguous: 2025-05-07T20:32:51.2060138Z x0 = x0.contiguous() 2025-05-07T20:32:51.2060398Z x1 = x1.contiguous() 2025-05-07T20:32:51.2060639Z 2025-05-07T20:32:51.2060829Z if scale_ub is not None: 2025-05-07T20:32:51.2061103Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.2061443Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.2061748Z ) 2025-05-07T20:32:51.2061946Z else: 2025-05-07T20:32:51.2062160Z scale_ub_tensor = None 2025-05-07T20:32:51.2062412Z 2025-05-07T20:32:51.2062648Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.2062965Z op = silu_mul_quant 2025-05-07T20:32:51.2063212Z if compiled: 2025-05-07T20:32:51.2063467Z op = torch.compile(op) 2025-05-07T20:32:51.2063783Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.2064055Z 2025-05-07T20:32:51.2064257Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.2064425Z 2025-05-07T20:32:51.2064532Z moe/activation_test.py:117: 2025-05-07T20:32:51.2064838Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.2065174Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.2065464Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.2066171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.2066873Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.2067428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.2068127Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.2068860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.2069403Z kernel = self.compile( 2025-05-07T20:32:51.2069958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.2070637Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.2071041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.2071285Z 2025-05-07T20:32:51.2071498Z self = 2025-05-07T20:32:51.2072613Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.2074040Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c66b380>} 2025-05-07T20:32:51.2075475Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.2076528Z context = 2025-05-07T20:32:51.2076867Z 2025-05-07T20:32:51.2077036Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.2077571Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.2078079Z module_map=module_map) 2025-05-07T20:32:51.2078513Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.2078872Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.2079140Z E ^ 2025-05-07T20:32:51.2079616Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.2080085Z 2025-05-07T20:32:51.2080514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.2081050Z 2025-05-07T20:32:51.2081154Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.2081579Z self=, 2025-05-07T20:32:51.2081988Z T=16384, 2025-05-07T20:32:51.2082190Z D=7168, 2025-05-07T20:32:51.2082390Z scale_ub=None, 2025-05-07T20:32:51.2082602Z contiguous=True, 2025-05-07T20:32:51.2082830Z compiled=True, 2025-05-07T20:32:51.2083039Z ) 2025-05-07T20:32:51.3831440Z self = 2025-05-07T20:32:51.3832254Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:51.3832549Z 2025-05-07T20:32:51.3832654Z @given( 2025-05-07T20:32:51.3832890Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3833207Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3833517Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3833851Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3834185Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3834477Z ) 2025-05-07T20:32:51.3834827Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3835285Z def test_silu_mul_quant( 2025-05-07T20:32:51.3835531Z self, 2025-05-07T20:32:51.3835725Z T: int, 2025-05-07T20:32:51.3835926Z D: int, 2025-05-07T20:32:51.3836153Z scale_ub: Optional[float], 2025-05-07T20:32:51.3836420Z contiguous: bool, 2025-05-07T20:32:51.3836663Z compiled: bool, 2025-05-07T20:32:51.3836897Z ) -> None: 2025-05-07T20:32:51.3837427Z torch.manual_seed(2025) 2025-05-07T20:32:51.3837702Z 2025-05-07T20:32:51.3837979Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3838323Z 2025-05-07T20:32:51.3838531Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3838828Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3839146Z x = x_sign * x_clamp 2025-05-07T20:32:51.3839387Z x0 = x[:, :D] 2025-05-07T20:32:51.3839614Z x1 = x[:, D:] 2025-05-07T20:32:51.3839826Z 2025-05-07T20:32:51.3840011Z if contiguous: 2025-05-07T20:32:51.3840244Z x0 = x0.contiguous() 2025-05-07T20:32:51.3840504Z x1 = x1.contiguous() 2025-05-07T20:32:51.3840741Z 2025-05-07T20:32:51.3840931Z if scale_ub is not None: 2025-05-07T20:32:51.3841209Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3841542Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3841855Z ) 2025-05-07T20:32:51.3842053Z else: 2025-05-07T20:32:51.3842258Z scale_ub_tensor = None 2025-05-07T20:32:51.3842511Z 2025-05-07T20:32:51.3842829Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3843145Z op = silu_mul_quant 2025-05-07T20:32:51.3843397Z if compiled: 2025-05-07T20:32:51.3843648Z op = torch.compile(op) 2025-05-07T20:32:51.3844024Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3844299Z 2025-05-07T20:32:51.3844490Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.3844657Z 2025-05-07T20:32:51.3844759Z moe/activation_test.py:117: 2025-05-07T20:32:51.3845054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3845474Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.3845761Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3846334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.3846917Z return fn(*args, **kwargs) 2025-05-07T20:32:51.3847600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.3848357Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.3848902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3849608Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3850298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3850841Z kernel = self.compile( 2025-05-07T20:32:51.3851407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3852178Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3852591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3852827Z 2025-05-07T20:32:51.3853044Z self = 2025-05-07T20:32:51.3854160Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3855604Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c0e04a0>} 2025-05-07T20:32:51.3856998Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3858063Z context = 2025-05-07T20:32:51.3858407Z 2025-05-07T20:32:51.3858577Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3859115Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3859596Z module_map=module_map) 2025-05-07T20:32:51.3859961Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3860329Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.3860594Z E ^ 2025-05-07T20:32:51.3861071Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3861537Z 2025-05-07T20:32:51.3861966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3862510Z 2025-05-07T20:32:51.3862614Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3863046Z self=, 2025-05-07T20:32:51.3863461Z T=4096, 2025-05-07T20:32:51.3863646Z D=5120, 2025-05-07T20:32:51.3863840Z scale_ub=None, 2025-05-07T20:32:51.3864105Z contiguous=False, 2025-05-07T20:32:51.3864328Z compiled=True, 2025-05-07T20:32:51.3864535Z ) 2025-05-07T20:32:51.3864861Z self = 2025-05-07T20:32:51.3865407Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:51.3865693Z 2025-05-07T20:32:51.3865773Z @given( 2025-05-07T20:32:51.3866008Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3866322Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3866632Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3867034Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3867364Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3867655Z ) 2025-05-07T20:32:51.3868017Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3868475Z def test_silu_mul_quant( 2025-05-07T20:32:51.3868715Z self, 2025-05-07T20:32:51.3868917Z T: int, 2025-05-07T20:32:51.3869120Z D: int, 2025-05-07T20:32:51.3869337Z scale_ub: Optional[float], 2025-05-07T20:32:51.3869618Z contiguous: bool, 2025-05-07T20:32:51.3869870Z compiled: bool, 2025-05-07T20:32:51.3870092Z ) -> None: 2025-05-07T20:32:51.3870311Z torch.manual_seed(2025) 2025-05-07T20:32:51.3870558Z 2025-05-07T20:32:51.3870828Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3871182Z 2025-05-07T20:32:51.3871381Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3871679Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3871998Z x = x_sign * x_clamp 2025-05-07T20:32:51.3872246Z x0 = x[:, :D] 2025-05-07T20:32:51.3872462Z x1 = x[:, D:] 2025-05-07T20:32:51.3872677Z 2025-05-07T20:32:51.3872869Z if contiguous: 2025-05-07T20:32:51.3873109Z x0 = x0.contiguous() 2025-05-07T20:32:51.3873369Z x1 = x1.contiguous() 2025-05-07T20:32:51.3873617Z 2025-05-07T20:32:51.3873811Z if scale_ub is not None: 2025-05-07T20:32:51.3874081Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3874423Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3874743Z ) 2025-05-07T20:32:51.3874936Z else: 2025-05-07T20:32:51.3875151Z scale_ub_tensor = None 2025-05-07T20:32:51.3875407Z 2025-05-07T20:32:51.3875636Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3875960Z op = silu_mul_quant 2025-05-07T20:32:51.3876215Z if compiled: 2025-05-07T20:32:51.3876457Z op = torch.compile(op) 2025-05-07T20:32:51.3876760Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3877092Z 2025-05-07T20:32:51.3877289Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.3877490Z 2025-05-07T20:32:51.3877606Z moe/activation_test.py:117: 2025-05-07T20:32:51.3877907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3878244Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.3878522Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3879113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.3879693Z return fn(*args, **kwargs) 2025-05-07T20:32:51.3880372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.3881083Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.3881635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3882349Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3883081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3883631Z kernel = self.compile( 2025-05-07T20:32:51.3884190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3884917Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3885319Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3885556Z 2025-05-07T20:32:51.3885769Z self = 2025-05-07T20:32:51.3886925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3888353Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c0e11c0>} 2025-05-07T20:32:51.3889742Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3890813Z context = 2025-05-07T20:32:51.3891116Z 2025-05-07T20:32:51.3891286Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3891877Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3892371Z module_map=module_map) 2025-05-07T20:32:51.3892744Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3893109Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.3893379Z E ^ 2025-05-07T20:32:51.3901777Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3902286Z 2025-05-07T20:32:51.3902722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3903252Z 2025-05-07T20:32:51.6981019Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.6981581Z self=, 2025-05-07T20:32:51.6982089Z T=4096, 2025-05-07T20:32:51.6982281Z D=5120, 2025-05-07T20:32:51.6982476Z scale_ub=1200.0, 2025-05-07T20:32:51.6982700Z contiguous=False, 2025-05-07T20:32:51.6982950Z compiled=False, 2025-05-07T20:32:51.6983165Z ) 2025-05-07T20:32:51.6983489Z self = 2025-05-07T20:32:51.6984295Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:51.6984594Z 2025-05-07T20:32:51.6984676Z @given( 2025-05-07T20:32:51.6984914Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.6985241Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.6985556Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.6985897Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.6986236Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.6986532Z ) 2025-05-07T20:32:51.6986901Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.6987353Z def test_silu_mul_quant( 2025-05-07T20:32:51.6987607Z self, 2025-05-07T20:32:51.6987812Z T: int, 2025-05-07T20:32:51.6988010Z D: int, 2025-05-07T20:32:51.6988237Z scale_ub: Optional[float], 2025-05-07T20:32:51.6988515Z contiguous: bool, 2025-05-07T20:32:51.6988759Z compiled: bool, 2025-05-07T20:32:51.6989013Z ) -> None: 2025-05-07T20:32:51.6989238Z torch.manual_seed(2025) 2025-05-07T20:32:51.6989494Z 2025-05-07T20:32:51.6989863Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.6990227Z 2025-05-07T20:32:51.6990436Z x_sign = torch.sign(x) 2025-05-07T20:32:51.6990730Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.6991133Z x = x_sign * x_clamp 2025-05-07T20:32:51.6991389Z x0 = x[:, :D] 2025-05-07T20:32:51.6991612Z x1 = x[:, D:] 2025-05-07T20:32:51.6991832Z 2025-05-07T20:32:51.6992036Z if contiguous: 2025-05-07T20:32:51.6992273Z x0 = x0.contiguous() 2025-05-07T20:32:51.6992545Z x1 = x1.contiguous() 2025-05-07T20:32:51.6992875Z 2025-05-07T20:32:51.6993072Z if scale_ub is not None: 2025-05-07T20:32:51.6993357Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.6993705Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.6994022Z ) 2025-05-07T20:32:51.6994228Z else: 2025-05-07T20:32:51.6994449Z scale_ub_tensor = None 2025-05-07T20:32:51.6994708Z 2025-05-07T20:32:51.6994945Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.6995272Z op = silu_mul_quant 2025-05-07T20:32:51.6995531Z if compiled: 2025-05-07T20:32:51.6995784Z op = torch.compile(op) 2025-05-07T20:32:51.6996092Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.6996376Z 2025-05-07T20:32:51.6996572Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.6996746Z 2025-05-07T20:32:51.6996850Z moe/activation_test.py:117: 2025-05-07T20:32:51.6997156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.6997498Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.6997792Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.6998515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.6999235Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.6999789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.7000496Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.7001187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.7001735Z kernel = self.compile( 2025-05-07T20:32:51.7002297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.7002980Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.7003395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.7003633Z 2025-05-07T20:32:51.7003903Z self = 2025-05-07T20:32:51.7005032Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.7006747Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c0e2160>} 2025-05-07T20:32:51.7008160Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.7009233Z context = 2025-05-07T20:32:51.7009532Z 2025-05-07T20:32:51.7009704Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.7010254Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.7010830Z module_map=module_map) 2025-05-07T20:32:51.7011214Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.7011586Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.7011937Z E ^ 2025-05-07T20:32:51.7012492Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.7012961Z 2025-05-07T20:32:51.7013396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.7013934Z 2025-05-07T20:32:51.7014102Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.7014536Z self=, 2025-05-07T20:32:51.7014955Z T=4096, 2025-05-07T20:32:51.7015146Z D=5120, 2025-05-07T20:32:51.7015353Z scale_ub=1200.0, 2025-05-07T20:32:51.7015590Z contiguous=False, 2025-05-07T20:32:51.7015820Z compiled=True, 2025-05-07T20:32:51.7016036Z ) 2025-05-07T20:32:51.7016370Z self = 2025-05-07T20:32:51.7016881Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:51.7017173Z 2025-05-07T20:32:51.7017256Z @given( 2025-05-07T20:32:51.7017496Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.7017814Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.7018130Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.7018469Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.7018812Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.7019101Z ) 2025-05-07T20:32:51.7019464Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.7019922Z def test_silu_mul_quant( 2025-05-07T20:32:51.7020168Z self, 2025-05-07T20:32:51.7020373Z T: int, 2025-05-07T20:32:51.7020578Z D: int, 2025-05-07T20:32:51.7020804Z scale_ub: Optional[float], 2025-05-07T20:32:51.7021090Z contiguous: bool, 2025-05-07T20:32:51.7021340Z compiled: bool, 2025-05-07T20:32:51.7021567Z ) -> None: 2025-05-07T20:32:51.7021793Z torch.manual_seed(2025) 2025-05-07T20:32:51.7022050Z 2025-05-07T20:32:51.7022325Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.7022679Z 2025-05-07T20:32:51.7022879Z x_sign = torch.sign(x) 2025-05-07T20:32:51.7023175Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.7023499Z x = x_sign * x_clamp 2025-05-07T20:32:51.7023749Z x0 = x[:, :D] 2025-05-07T20:32:51.7023975Z x1 = x[:, D:] 2025-05-07T20:32:51.7024184Z 2025-05-07T20:32:51.7024377Z if contiguous: 2025-05-07T20:32:51.7024688Z x0 = x0.contiguous() 2025-05-07T20:32:51.7024950Z x1 = x1.contiguous() 2025-05-07T20:32:51.7025198Z 2025-05-07T20:32:51.7025402Z if scale_ub is not None: 2025-05-07T20:32:51.7025679Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.7026025Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.7026347Z ) 2025-05-07T20:32:51.7026549Z else: 2025-05-07T20:32:51.7026770Z scale_ub_tensor = None 2025-05-07T20:32:51.7027030Z 2025-05-07T20:32:51.7027270Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.7027630Z op = silu_mul_quant 2025-05-07T20:32:51.7027907Z if compiled: 2025-05-07T20:32:51.7028159Z op = torch.compile(op) 2025-05-07T20:32:51.7028469Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.7028758Z 2025-05-07T20:32:51.7028955Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.7029132Z 2025-05-07T20:32:51.7029238Z moe/activation_test.py:117: 2025-05-07T20:32:51.7029547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.7030005Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.7030296Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.7030879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.7031527Z return fn(*args, **kwargs) 2025-05-07T20:32:51.7032203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.7032917Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.7033473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.7034223Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.7034908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.7035459Z kernel = self.compile( 2025-05-07T20:32:51.7036027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.7036709Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.7037115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.7037361Z 2025-05-07T20:32:51.7037576Z self = 2025-05-07T20:32:51.7038748Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.7040181Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c0e3240>} 2025-05-07T20:32:51.7041576Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.7042637Z context = 2025-05-07T20:32:51.7042946Z 2025-05-07T20:32:51.7043121Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.7043661Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.7044140Z module_map=module_map) 2025-05-07T20:32:51.7044520Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.7044884Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.7045150Z E ^ 2025-05-07T20:32:51.7045682Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.7046157Z 2025-05-07T20:32:51.7046591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.7047134Z 2025-05-07T20:32:51.8192597Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.8193789Z self=, 2025-05-07T20:32:51.8194684Z T=2048, 2025-05-07T20:32:51.8195061Z D=7168, 2025-05-07T20:32:51.8195461Z scale_ub=1200.0, 2025-05-07T20:32:51.8195918Z contiguous=False, 2025-05-07T20:32:51.8196377Z compiled=False, 2025-05-07T20:32:51.8196783Z ) 2025-05-07T20:32:51.8197435Z self = 2025-05-07T20:32:51.8197983Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:51.8198292Z 2025-05-07T20:32:51.8198375Z @given( 2025-05-07T20:32:51.8198624Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.8198943Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.8199509Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.8199858Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.8200193Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.8200491Z ) 2025-05-07T20:32:51.8200928Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.8201383Z def test_silu_mul_quant( 2025-05-07T20:32:51.8201639Z self, 2025-05-07T20:32:51.8201847Z T: int, 2025-05-07T20:32:51.8202055Z D: int, 2025-05-07T20:32:51.8202273Z scale_ub: Optional[float], 2025-05-07T20:32:51.8202636Z contiguous: bool, 2025-05-07T20:32:51.8202883Z compiled: bool, 2025-05-07T20:32:51.8203109Z ) -> None: 2025-05-07T20:32:51.8203332Z torch.manual_seed(2025) 2025-05-07T20:32:51.8203581Z 2025-05-07T20:32:51.8203866Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.8204218Z 2025-05-07T20:32:51.8204415Z x_sign = torch.sign(x) 2025-05-07T20:32:51.8204710Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.8205030Z x = x_sign * x_clamp 2025-05-07T20:32:51.8205282Z x0 = x[:, :D] 2025-05-07T20:32:51.8205504Z x1 = x[:, D:] 2025-05-07T20:32:51.8205719Z 2025-05-07T20:32:51.8205913Z if contiguous: 2025-05-07T20:32:51.8206390Z x0 = x0.contiguous() 2025-05-07T20:32:51.8206665Z x1 = x1.contiguous() 2025-05-07T20:32:51.8206914Z 2025-05-07T20:32:51.8207105Z if scale_ub is not None: 2025-05-07T20:32:51.8207390Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.8207739Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.8208063Z ) 2025-05-07T20:32:51.8208291Z else: 2025-05-07T20:32:51.8208530Z scale_ub_tensor = None 2025-05-07T20:32:51.8208792Z 2025-05-07T20:32:51.8209024Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.8209357Z op = silu_mul_quant 2025-05-07T20:32:51.8209615Z if compiled: 2025-05-07T20:32:51.8209865Z op = torch.compile(op) 2025-05-07T20:32:51.8210175Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.8210462Z 2025-05-07T20:32:51.8210661Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.8210836Z 2025-05-07T20:32:51.8210941Z moe/activation_test.py:117: 2025-05-07T20:32:51.8211248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.8211587Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.8211949Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.8212669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.8213480Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.8214035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.8214748Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.8215442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.8216000Z kernel = self.compile( 2025-05-07T20:32:51.8216556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.8217237Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.8217653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.8217893Z 2025-05-07T20:32:51.8218135Z self = 2025-05-07T20:32:51.8219354Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.8220792Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c1a4220>} 2025-05-07T20:32:51.8222251Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.8223323Z context = 2025-05-07T20:32:51.8223678Z 2025-05-07T20:32:51.8223849Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.8224393Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.8224882Z module_map=module_map) 2025-05-07T20:32:51.8225261Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.8225623Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.8225890Z E ^ 2025-05-07T20:32:51.8226372Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.8226841Z 2025-05-07T20:32:51.8227272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.8227811Z 2025-05-07T20:32:51.8227921Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.8228350Z self=, 2025-05-07T20:32:51.8228772Z T=1, 2025-05-07T20:32:51.8228960Z D=7168, 2025-05-07T20:32:51.8229162Z scale_ub=None, 2025-05-07T20:32:51.8229385Z contiguous=True, 2025-05-07T20:32:51.8229613Z compiled=False, 2025-05-07T20:32:51.8229828Z ) 2025-05-07T20:32:51.8230156Z self = 2025-05-07T20:32:51.8230657Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:51.8230932Z 2025-05-07T20:32:51.8231013Z @given( 2025-05-07T20:32:51.8231252Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.8231577Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.8231892Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.8232232Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.8232574Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.8232863Z ) 2025-05-07T20:32:51.8233228Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.8233685Z def test_silu_mul_quant( 2025-05-07T20:32:51.8233932Z self, 2025-05-07T20:32:51.8234136Z T: int, 2025-05-07T20:32:51.8234393Z D: int, 2025-05-07T20:32:51.8234616Z scale_ub: Optional[float], 2025-05-07T20:32:51.8234894Z contiguous: bool, 2025-05-07T20:32:51.8235144Z compiled: bool, 2025-05-07T20:32:51.8235375Z ) -> None: 2025-05-07T20:32:51.8235600Z torch.manual_seed(2025) 2025-05-07T20:32:51.8235852Z 2025-05-07T20:32:51.8236133Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.8236492Z 2025-05-07T20:32:51.8236697Z x_sign = torch.sign(x) 2025-05-07T20:32:51.8236993Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.8237314Z x = x_sign * x_clamp 2025-05-07T20:32:51.8237568Z x0 = x[:, :D] 2025-05-07T20:32:51.8237793Z x1 = x[:, D:] 2025-05-07T20:32:51.8238008Z 2025-05-07T20:32:51.8238201Z if contiguous: 2025-05-07T20:32:51.8238443Z x0 = x0.contiguous() 2025-05-07T20:32:51.8238704Z x1 = x1.contiguous() 2025-05-07T20:32:51.8238956Z 2025-05-07T20:32:51.8239158Z if scale_ub is not None: 2025-05-07T20:32:51.8239436Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.8239831Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.8240149Z ) 2025-05-07T20:32:51.8240343Z else: 2025-05-07T20:32:51.8240560Z scale_ub_tensor = None 2025-05-07T20:32:51.8240819Z 2025-05-07T20:32:51.8241096Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.8241419Z op = silu_mul_quant 2025-05-07T20:32:51.8241678Z if compiled: 2025-05-07T20:32:51.8241926Z op = torch.compile(op) 2025-05-07T20:32:51.8242230Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.8242557Z 2025-05-07T20:32:51.8242759Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.8242928Z 2025-05-07T20:32:51.8243030Z moe/activation_test.py:117: 2025-05-07T20:32:51.8243341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.8243686Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.8243970Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.8244687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.8245402Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.8245962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.8246666Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.8247356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.8247912Z kernel = self.compile( 2025-05-07T20:32:51.8248474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.8249176Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.8249594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.8249836Z 2025-05-07T20:32:51.8250062Z self = 2025-05-07T20:32:51.8251201Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.8252724Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c1a5120>} 2025-05-07T20:32:51.8254157Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.8255294Z context = 2025-05-07T20:32:51.8255600Z 2025-05-07T20:32:51.8255786Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.8256332Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.8256825Z module_map=module_map) 2025-05-07T20:32:51.8257208Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.8257599Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.8257895Z E ^ 2025-05-07T20:32:51.8258385Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.8258867Z 2025-05-07T20:32:51.8259319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.8259864Z 2025-05-07T20:32:51.8259976Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.8260422Z self=, 2025-05-07T20:32:51.8260849Z T=16384, 2025-05-07T20:32:51.8261124Z D=7168, 2025-05-07T20:32:51.8261331Z scale_ub=1200.0, 2025-05-07T20:32:51.8261568Z contiguous=False, 2025-05-07T20:32:51.8261798Z compiled=True, 2025-05-07T20:32:52.0663208Z ) 2025-05-07T20:32:52.0664100Z self = 2025-05-07T20:32:52.0664804Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:52.0665097Z 2025-05-07T20:32:52.0665179Z @given( 2025-05-07T20:32:52.0665422Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.0665854Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.0666166Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.0666495Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.0666839Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.0667132Z ) 2025-05-07T20:32:52.0667485Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.0667949Z def test_silu_mul_quant( 2025-05-07T20:32:52.0668249Z self, 2025-05-07T20:32:52.0668450Z T: int, 2025-05-07T20:32:52.0668658Z D: int, 2025-05-07T20:32:52.0668891Z scale_ub: Optional[float], 2025-05-07T20:32:52.0669172Z contiguous: bool, 2025-05-07T20:32:52.0669426Z compiled: bool, 2025-05-07T20:32:52.0669665Z ) -> None: 2025-05-07T20:32:52.0676937Z torch.manual_seed(2025) 2025-05-07T20:32:52.0677225Z 2025-05-07T20:32:52.0677512Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.0677871Z 2025-05-07T20:32:52.0678070Z x_sign = torch.sign(x) 2025-05-07T20:32:52.0678419Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.0678739Z x = x_sign * x_clamp 2025-05-07T20:32:52.0678982Z x0 = x[:, :D] 2025-05-07T20:32:52.0679208Z x1 = x[:, D:] 2025-05-07T20:32:52.0679425Z 2025-05-07T20:32:52.0679614Z if contiguous: 2025-05-07T20:32:52.0679864Z x0 = x0.contiguous() 2025-05-07T20:32:52.0680132Z x1 = x1.contiguous() 2025-05-07T20:32:52.0680375Z 2025-05-07T20:32:52.0680563Z if scale_ub is not None: 2025-05-07T20:32:52.0680834Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.0681185Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.0681502Z ) 2025-05-07T20:32:52.0681703Z else: 2025-05-07T20:32:52.0681922Z scale_ub_tensor = None 2025-05-07T20:32:52.0682174Z 2025-05-07T20:32:52.0682418Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.0682745Z op = silu_mul_quant 2025-05-07T20:32:52.0682996Z if compiled: 2025-05-07T20:32:52.0683249Z op = torch.compile(op) 2025-05-07T20:32:52.0683686Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.0683967Z 2025-05-07T20:32:52.0684166Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.0684342Z 2025-05-07T20:32:52.0684445Z moe/activation_test.py:117: 2025-05-07T20:32:52.0684753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.0685088Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.0685381Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.0685961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.0686535Z return fn(*args, **kwargs) 2025-05-07T20:32:52.0687215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.0687976Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.0688532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.0689234Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.0690005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.0690564Z kernel = self.compile( 2025-05-07T20:32:52.0691120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.0691917Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.0692333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.0692572Z 2025-05-07T20:32:52.0692792Z self = 2025-05-07T20:32:52.0693958Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.0695396Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c1a6520>} 2025-05-07T20:32:52.0696792Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.0697862Z context = 2025-05-07T20:32:52.0698160Z 2025-05-07T20:32:52.0698338Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.0698874Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.0699361Z module_map=module_map) 2025-05-07T20:32:52.0699741Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.0700101Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.0700372Z E ^ 2025-05-07T20:32:52.0700856Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.0701326Z 2025-05-07T20:32:52.0701763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.0702298Z 2025-05-07T20:32:52.0702409Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.0702841Z self=, 2025-05-07T20:32:52.0703260Z T=1, 2025-05-07T20:32:52.0703448Z D=7168, 2025-05-07T20:32:52.0703656Z scale_ub=None, 2025-05-07T20:32:52.0703882Z contiguous=False, 2025-05-07T20:32:52.0704114Z compiled=False, 2025-05-07T20:32:52.0704334Z ) 2025-05-07T20:32:52.0704715Z self = 2025-05-07T20:32:52.0705228Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:52.0705501Z 2025-05-07T20:32:52.0705583Z @given( 2025-05-07T20:32:52.0705824Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.0706607Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.0706927Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.0707273Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.0707618Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.0707909Z ) 2025-05-07T20:32:52.0708268Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.0708723Z def test_silu_mul_quant( 2025-05-07T20:32:52.0708977Z self, 2025-05-07T20:32:52.0709173Z T: int, 2025-05-07T20:32:52.0709378Z D: int, 2025-05-07T20:32:52.0709602Z scale_ub: Optional[float], 2025-05-07T20:32:52.0709879Z contiguous: bool, 2025-05-07T20:32:52.0710130Z compiled: bool, 2025-05-07T20:32:52.0710364Z ) -> None: 2025-05-07T20:32:52.0710582Z torch.manual_seed(2025) 2025-05-07T20:32:52.0710911Z 2025-05-07T20:32:52.0711193Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.0711541Z 2025-05-07T20:32:52.0711748Z x_sign = torch.sign(x) 2025-05-07T20:32:52.0712111Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.0712432Z x = x_sign * x_clamp 2025-05-07T20:32:52.0712683Z x0 = x[:, :D] 2025-05-07T20:32:52.0712909Z x1 = x[:, D:] 2025-05-07T20:32:52.0713116Z 2025-05-07T20:32:52.0713309Z if contiguous: 2025-05-07T20:32:52.0713550Z x0 = x0.contiguous() 2025-05-07T20:32:52.0713899Z x1 = x1.contiguous() 2025-05-07T20:32:52.0714145Z 2025-05-07T20:32:52.0714344Z if scale_ub is not None: 2025-05-07T20:32:52.0714625Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.0714969Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.0715289Z ) 2025-05-07T20:32:52.0715490Z else: 2025-05-07T20:32:52.0715709Z scale_ub_tensor = None 2025-05-07T20:32:52.0715972Z 2025-05-07T20:32:52.0716216Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.0716539Z op = silu_mul_quant 2025-05-07T20:32:52.0716803Z if compiled: 2025-05-07T20:32:52.0717060Z op = torch.compile(op) 2025-05-07T20:32:52.0717362Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.0717648Z 2025-05-07T20:32:52.0717851Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.0718020Z 2025-05-07T20:32:52.0718123Z moe/activation_test.py:117: 2025-05-07T20:32:52.0718437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.0718785Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.0719079Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.0719790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.0720510Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.0721069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.0721776Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.0722471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.0723025Z kernel = self.compile( 2025-05-07T20:32:52.0723584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.0724261Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.0724759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.0725000Z 2025-05-07T20:32:52.0725218Z self = 2025-05-07T20:32:52.0726348Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.0727771Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c1a7100>} 2025-05-07T20:32:52.0729169Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.0730238Z context = 2025-05-07T20:32:52.0730539Z 2025-05-07T20:32:52.0730719Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.0731298Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.0731785Z module_map=module_map) 2025-05-07T20:32:52.0732262Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.0732628Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.0732934Z E ^ 2025-05-07T20:32:52.0733415Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.0733883Z 2025-05-07T20:32:52.0734320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.0734893Z 2025-05-07T20:32:52.0735005Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.0735427Z self=, 2025-05-07T20:32:52.0735850Z T=2048, 2025-05-07T20:32:52.0736043Z D=7168, 2025-05-07T20:32:52.0736238Z scale_ub=None, 2025-05-07T20:32:52.0736462Z contiguous=False, 2025-05-07T20:32:52.0736701Z compiled=True, 2025-05-07T20:32:52.0736907Z ) 2025-05-07T20:32:52.1601255Z self = 2025-05-07T20:32:52.1602036Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:52.1602447Z 2025-05-07T20:32:52.1602531Z @given( 2025-05-07T20:32:52.1602765Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.1603077Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.1603397Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.1603741Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.1604080Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.1604366Z ) 2025-05-07T20:32:52.1604732Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.1605194Z def test_silu_mul_quant( 2025-05-07T20:32:52.1605441Z self, 2025-05-07T20:32:52.1605645Z T: int, 2025-05-07T20:32:52.1605858Z D: int, 2025-05-07T20:32:52.1606078Z scale_ub: Optional[float], 2025-05-07T20:32:52.1606611Z contiguous: bool, 2025-05-07T20:32:52.1606863Z compiled: bool, 2025-05-07T20:32:52.1607100Z ) -> None: 2025-05-07T20:32:52.1607327Z torch.manual_seed(2025) 2025-05-07T20:32:52.1607605Z 2025-05-07T20:32:52.1607903Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.1608262Z 2025-05-07T20:32:52.1608462Z x_sign = torch.sign(x) 2025-05-07T20:32:52.1608759Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.1609078Z x = x_sign * x_clamp 2025-05-07T20:32:52.1609324Z x0 = x[:, :D] 2025-05-07T20:32:52.1609546Z x1 = x[:, D:] 2025-05-07T20:32:52.1609752Z 2025-05-07T20:32:52.1610238Z if contiguous: 2025-05-07T20:32:52.1610480Z x0 = x0.contiguous() 2025-05-07T20:32:52.1610742Z x1 = x1.contiguous() 2025-05-07T20:32:52.1610993Z 2025-05-07T20:32:52.1611190Z if scale_ub is not None: 2025-05-07T20:32:52.1611464Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.1611883Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.1612207Z ) 2025-05-07T20:32:52.1612404Z else: 2025-05-07T20:32:52.1612621Z scale_ub_tensor = None 2025-05-07T20:32:52.1612882Z 2025-05-07T20:32:52.1613117Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.1613441Z op = silu_mul_quant 2025-05-07T20:32:52.1613701Z if compiled: 2025-05-07T20:32:52.1613949Z op = torch.compile(op) 2025-05-07T20:32:52.1614265Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.1614550Z 2025-05-07T20:32:52.1614748Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.1614923Z 2025-05-07T20:32:52.1615026Z moe/activation_test.py:117: 2025-05-07T20:32:52.1615422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.1615769Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.1616055Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.1616637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.1617288Z return fn(*args, **kwargs) 2025-05-07T20:32:52.1617971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.1618689Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.1619318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.1620034Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.1620718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.1621274Z kernel = self.compile( 2025-05-07T20:32:52.1621837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.1622512Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.1622929Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.1623176Z 2025-05-07T20:32:52.1623391Z self = 2025-05-07T20:32:52.1624516Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.1625964Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c7d0720>} 2025-05-07T20:32:52.1627353Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.1628469Z context = 2025-05-07T20:32:52.1628774Z 2025-05-07T20:32:52.1628945Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.1629485Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.1629965Z module_map=module_map) 2025-05-07T20:32:52.1630338Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.1630707Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.1630969Z E ^ 2025-05-07T20:32:52.1631493Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.1631967Z 2025-05-07T20:32:52.1632401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.1632933Z 2025-05-07T20:32:52.1633045Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.1633472Z self=, 2025-05-07T20:32:52.1633892Z T=4096, 2025-05-07T20:32:52.1634089Z D=7168, 2025-05-07T20:32:52.1634286Z scale_ub=None, 2025-05-07T20:32:52.1634507Z contiguous=False, 2025-05-07T20:32:52.1634741Z compiled=True, 2025-05-07T20:32:52.1634951Z ) 2025-05-07T20:32:52.1635280Z self = 2025-05-07T20:32:52.1635797Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:52.1636080Z 2025-05-07T20:32:52.1636174Z @given( 2025-05-07T20:32:52.1636408Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.1636783Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.1637105Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.1637437Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.1637781Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.1638128Z ) 2025-05-07T20:32:52.1638532Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.1638995Z def test_silu_mul_quant( 2025-05-07T20:32:52.1639250Z self, 2025-05-07T20:32:52.1639455Z T: int, 2025-05-07T20:32:52.1639659Z D: int, 2025-05-07T20:32:52.1639929Z scale_ub: Optional[float], 2025-05-07T20:32:52.1640213Z contiguous: bool, 2025-05-07T20:32:52.1640455Z compiled: bool, 2025-05-07T20:32:52.1640692Z ) -> None: 2025-05-07T20:32:52.1640919Z torch.manual_seed(2025) 2025-05-07T20:32:52.1641164Z 2025-05-07T20:32:52.1641444Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.1641799Z 2025-05-07T20:32:52.1641995Z x_sign = torch.sign(x) 2025-05-07T20:32:52.1642300Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.1642623Z x = x_sign * x_clamp 2025-05-07T20:32:52.1642873Z x0 = x[:, :D] 2025-05-07T20:32:52.1643099Z x1 = x[:, D:] 2025-05-07T20:32:52.1643312Z 2025-05-07T20:32:52.1643499Z if contiguous: 2025-05-07T20:32:52.1643738Z x0 = x0.contiguous() 2025-05-07T20:32:52.1644010Z x1 = x1.contiguous() 2025-05-07T20:32:52.1644251Z 2025-05-07T20:32:52.1644455Z if scale_ub is not None: 2025-05-07T20:32:52.1644734Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.1645076Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.1645394Z ) 2025-05-07T20:32:52.1645599Z else: 2025-05-07T20:32:52.1645821Z scale_ub_tensor = None 2025-05-07T20:32:52.1646072Z 2025-05-07T20:32:52.1646314Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.1646639Z op = silu_mul_quant 2025-05-07T20:32:52.1646891Z if compiled: 2025-05-07T20:32:52.1647148Z op = torch.compile(op) 2025-05-07T20:32:52.1647457Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.1647734Z 2025-05-07T20:32:52.1647933Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.1648103Z 2025-05-07T20:32:52.1648209Z moe/activation_test.py:117: 2025-05-07T20:32:52.1648506Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.1648852Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.1649142Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.1649769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.1650343Z return fn(*args, **kwargs) 2025-05-07T20:32:52.1651026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.1651736Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.1652350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.1653060Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.1653748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.1654302Z kernel = self.compile( 2025-05-07T20:32:52.1654863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.1655549Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.1655970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.1656209Z 2025-05-07T20:32:52.1656484Z self = 2025-05-07T20:32:52.1657602Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.1659065Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c7d1440>} 2025-05-07T20:32:52.1660462Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.1661580Z context = 2025-05-07T20:32:52.1661878Z 2025-05-07T20:32:52.1662051Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.1662596Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.1663083Z module_map=module_map) 2025-05-07T20:32:52.1663460Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.1663824Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.1664097Z E ^ 2025-05-07T20:32:52.1664580Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.1665047Z 2025-05-07T20:32:52.1665487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.1666022Z 2025-05-07T20:32:52.3253995Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3254689Z self=, 2025-05-07T20:32:52.3255261Z T=16384, 2025-05-07T20:32:52.3255560Z D=5120, 2025-05-07T20:32:52.3255755Z scale_ub=1200.0, 2025-05-07T20:32:52.3255983Z contiguous=False, 2025-05-07T20:32:52.3256207Z compiled=False, 2025-05-07T20:32:52.3256421Z ) 2025-05-07T20:32:52.3256747Z self = 2025-05-07T20:32:52.3257282Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:52.3257573Z 2025-05-07T20:32:52.3257659Z @given( 2025-05-07T20:32:52.3257890Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3258219Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3258541Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3258881Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3259217Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3259513Z ) 2025-05-07T20:32:52.3260165Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3260623Z def test_silu_mul_quant( 2025-05-07T20:32:52.3260878Z self, 2025-05-07T20:32:52.3261082Z T: int, 2025-05-07T20:32:52.3261282Z D: int, 2025-05-07T20:32:52.3261512Z scale_ub: Optional[float], 2025-05-07T20:32:52.3261792Z contiguous: bool, 2025-05-07T20:32:52.3262038Z compiled: bool, 2025-05-07T20:32:52.3262275Z ) -> None: 2025-05-07T20:32:52.3262499Z torch.manual_seed(2025) 2025-05-07T20:32:52.3262745Z 2025-05-07T20:32:52.3263029Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3263386Z 2025-05-07T20:32:52.3263592Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3263887Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3264209Z x = x_sign * x_clamp 2025-05-07T20:32:52.3264459Z x0 = x[:, :D] 2025-05-07T20:32:52.3264686Z x1 = x[:, D:] 2025-05-07T20:32:52.3264903Z 2025-05-07T20:32:52.3265102Z if contiguous: 2025-05-07T20:32:52.3265338Z x0 = x0.contiguous() 2025-05-07T20:32:52.3265706Z x1 = x1.contiguous() 2025-05-07T20:32:52.3265963Z 2025-05-07T20:32:52.3266157Z if scale_ub is not None: 2025-05-07T20:32:52.3266439Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3266864Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3267177Z ) 2025-05-07T20:32:52.3267379Z else: 2025-05-07T20:32:52.3267598Z scale_ub_tensor = None 2025-05-07T20:32:52.3267854Z 2025-05-07T20:32:52.3268092Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3268520Z op = silu_mul_quant 2025-05-07T20:32:52.3268780Z if compiled: 2025-05-07T20:32:52.3269029Z op = torch.compile(op) 2025-05-07T20:32:52.3269337Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3269622Z 2025-05-07T20:32:52.3269816Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3269992Z 2025-05-07T20:32:52.3270094Z moe/activation_test.py:117: 2025-05-07T20:32:52.3270403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3270745Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3271034Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3271755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3272470Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3273019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3273731Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3274424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3274975Z kernel = self.compile( 2025-05-07T20:32:52.3275538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3276219Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3276633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3276874Z 2025-05-07T20:32:52.3277087Z self = 2025-05-07T20:32:52.3278222Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3279750Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c7d2340>} 2025-05-07T20:32:52.3281182Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3282266Z context = 2025-05-07T20:32:52.3282574Z 2025-05-07T20:32:52.3282750Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3283300Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3290647Z module_map=module_map) 2025-05-07T20:32:52.3291251Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3291630Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3291941Z E ^ 2025-05-07T20:32:52.3292429Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3292895Z 2025-05-07T20:32:52.3293412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3293958Z 2025-05-07T20:32:52.3294065Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3294495Z self=, 2025-05-07T20:32:52.3294963Z T=16384, 2025-05-07T20:32:52.3295159Z D=5120, 2025-05-07T20:32:52.3295359Z scale_ub=1200.0, 2025-05-07T20:32:52.3295597Z contiguous=True, 2025-05-07T20:32:52.3295820Z compiled=True, 2025-05-07T20:32:52.3296031Z ) 2025-05-07T20:32:52.3296367Z self = 2025-05-07T20:32:52.3296928Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:52.3297228Z 2025-05-07T20:32:52.3297307Z @given( 2025-05-07T20:32:52.3297548Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3297867Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3298183Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3298528Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3298866Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3299153Z ) 2025-05-07T20:32:52.3299512Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3299972Z def test_silu_mul_quant( 2025-05-07T20:32:52.3300216Z self, 2025-05-07T20:32:52.3300415Z T: int, 2025-05-07T20:32:52.3300616Z D: int, 2025-05-07T20:32:52.3300835Z scale_ub: Optional[float], 2025-05-07T20:32:52.3301113Z contiguous: bool, 2025-05-07T20:32:52.3301362Z compiled: bool, 2025-05-07T20:32:52.3301588Z ) -> None: 2025-05-07T20:32:52.3301816Z torch.manual_seed(2025) 2025-05-07T20:32:52.3302067Z 2025-05-07T20:32:52.3302345Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3302695Z 2025-05-07T20:32:52.3302892Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3303190Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3303508Z x = x_sign * x_clamp 2025-05-07T20:32:52.3303756Z x0 = x[:, :D] 2025-05-07T20:32:52.3303972Z x1 = x[:, D:] 2025-05-07T20:32:52.3304190Z 2025-05-07T20:32:52.3304383Z if contiguous: 2025-05-07T20:32:52.3304619Z x0 = x0.contiguous() 2025-05-07T20:32:52.3304877Z x1 = x1.contiguous() 2025-05-07T20:32:52.3305120Z 2025-05-07T20:32:52.3305316Z if scale_ub is not None: 2025-05-07T20:32:52.3305596Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3305946Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3306579Z ) 2025-05-07T20:32:52.3306774Z else: 2025-05-07T20:32:52.3306993Z scale_ub_tensor = None 2025-05-07T20:32:52.3307340Z 2025-05-07T20:32:52.3307575Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3307896Z op = silu_mul_quant 2025-05-07T20:32:52.3308177Z if compiled: 2025-05-07T20:32:52.3308523Z op = torch.compile(op) 2025-05-07T20:32:52.3308868Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3309151Z 2025-05-07T20:32:52.3309347Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3309522Z 2025-05-07T20:32:52.3309624Z moe/activation_test.py:117: 2025-05-07T20:32:52.3309934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3310279Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3310564Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3311146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3311731Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3312408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3313125Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3313766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3314473Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3315227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3315780Z kernel = self.compile( 2025-05-07T20:32:52.3316343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3317085Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3317494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3317738Z 2025-05-07T20:32:52.3317956Z self = 2025-05-07T20:32:52.3319078Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3320503Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c7d39c0>} 2025-05-07T20:32:52.3321892Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3322958Z context = 2025-05-07T20:32:52.3323262Z 2025-05-07T20:32:52.3323433Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3323978Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3324460Z module_map=module_map) 2025-05-07T20:32:52.3324838Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3325204Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3325467Z E ^ 2025-05-07T20:32:52.3325951Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3326422Z 2025-05-07T20:32:52.3326852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3327388Z 2025-05-07T20:32:52.5025540Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.5026123Z self=, 2025-05-07T20:32:52.5026546Z T=16384, 2025-05-07T20:32:52.5026991Z D=5120, 2025-05-07T20:32:52.5027219Z scale_ub=None, 2025-05-07T20:32:52.5027441Z contiguous=False, 2025-05-07T20:32:52.5027675Z compiled=True, 2025-05-07T20:32:52.5027937Z ) 2025-05-07T20:32:52.5028287Z self = 2025-05-07T20:32:52.5028807Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:52.5029108Z 2025-05-07T20:32:52.5029192Z @given( 2025-05-07T20:32:52.5029434Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.5029759Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.5030073Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.5030416Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.5030763Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.5031057Z ) 2025-05-07T20:32:52.5031425Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.5031891Z def test_silu_mul_quant( 2025-05-07T20:32:52.5032140Z self, 2025-05-07T20:32:52.5032345Z T: int, 2025-05-07T20:32:52.5032552Z D: int, 2025-05-07T20:32:52.5032859Z scale_ub: Optional[float], 2025-05-07T20:32:52.5033145Z contiguous: bool, 2025-05-07T20:32:52.5033399Z compiled: bool, 2025-05-07T20:32:52.5033631Z ) -> None: 2025-05-07T20:32:52.5033930Z torch.manual_seed(2025) 2025-05-07T20:32:52.5034184Z 2025-05-07T20:32:52.5034462Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.5034819Z 2025-05-07T20:32:52.5035024Z x_sign = torch.sign(x) 2025-05-07T20:32:52.5035322Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.5035713Z x = x_sign * x_clamp 2025-05-07T20:32:52.5035962Z x0 = x[:, :D] 2025-05-07T20:32:52.5036187Z x1 = x[:, D:] 2025-05-07T20:32:52.5036397Z 2025-05-07T20:32:52.5036594Z if contiguous: 2025-05-07T20:32:52.5036840Z x0 = x0.contiguous() 2025-05-07T20:32:52.5037103Z x1 = x1.contiguous() 2025-05-07T20:32:52.5037352Z 2025-05-07T20:32:52.5037553Z if scale_ub is not None: 2025-05-07T20:32:52.5037829Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.5038223Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.5038546Z ) 2025-05-07T20:32:52.5038745Z else: 2025-05-07T20:32:52.5038966Z scale_ub_tensor = None 2025-05-07T20:32:52.5039226Z 2025-05-07T20:32:52.5039462Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.5039791Z op = silu_mul_quant 2025-05-07T20:32:52.5040052Z if compiled: 2025-05-07T20:32:52.5040313Z op = torch.compile(op) 2025-05-07T20:32:52.5040611Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.5040898Z 2025-05-07T20:32:52.5041105Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.5041274Z 2025-05-07T20:32:52.5041379Z moe/activation_test.py:117: 2025-05-07T20:32:52.5041688Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.5042039Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.5042329Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.5042919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.5043510Z return fn(*args, **kwargs) 2025-05-07T20:32:52.5044196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.5044909Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.5045468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.5046190Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.5046933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.5047496Z kernel = self.compile( 2025-05-07T20:32:52.5048069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.5048756Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.5049172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.5049420Z 2025-05-07T20:32:52.5049640Z self = 2025-05-07T20:32:52.5050772Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.5052303Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0ce40c20>} 2025-05-07T20:32:52.5053753Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.5054842Z context = 2025-05-07T20:32:52.5055222Z 2025-05-07T20:32:52.5055396Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.5055952Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.5056438Z module_map=module_map) 2025-05-07T20:32:52.5056860Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.5057232Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.5057503Z E ^ 2025-05-07T20:32:52.5057994Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.5058478Z 2025-05-07T20:32:52.5058922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.5059467Z 2025-05-07T20:32:52.5059580Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.5060024Z self=, 2025-05-07T20:32:52.5060447Z T=2048, 2025-05-07T20:32:52.5060643Z D=5120, 2025-05-07T20:32:52.5060842Z scale_ub=None, 2025-05-07T20:32:52.5061062Z contiguous=False, 2025-05-07T20:32:52.5061297Z compiled=True, 2025-05-07T20:32:52.5061519Z ) 2025-05-07T20:32:52.5967326Z self = 2025-05-07T20:32:52.5968132Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:52.5968737Z 2025-05-07T20:32:52.5968899Z @given( 2025-05-07T20:32:52.5969386Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.5970035Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.5970671Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.5971354Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.5972132Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.5972718Z ) 2025-05-07T20:32:52.5973446Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.5974367Z def test_silu_mul_quant( 2025-05-07T20:32:52.5974856Z self, 2025-05-07T20:32:52.5975259Z T: int, 2025-05-07T20:32:52.5975664Z D: int, 2025-05-07T20:32:52.5976106Z scale_ub: Optional[float], 2025-05-07T20:32:52.5976658Z contiguous: bool, 2025-05-07T20:32:52.5977151Z compiled: bool, 2025-05-07T20:32:52.5977611Z ) -> None: 2025-05-07T20:32:52.5978046Z torch.manual_seed(2025) 2025-05-07T20:32:52.5978665Z 2025-05-07T20:32:52.5978953Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.5979304Z 2025-05-07T20:32:52.5979505Z x_sign = torch.sign(x) 2025-05-07T20:32:52.5979811Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.5980128Z x = x_sign * x_clamp 2025-05-07T20:32:52.5980376Z x0 = x[:, :D] 2025-05-07T20:32:52.5980606Z x1 = x[:, D:] 2025-05-07T20:32:52.5980816Z 2025-05-07T20:32:52.5981008Z if contiguous: 2025-05-07T20:32:52.5981247Z x0 = x0.contiguous() 2025-05-07T20:32:52.5981510Z x1 = x1.contiguous() 2025-05-07T20:32:52.5981758Z 2025-05-07T20:32:52.5981956Z if scale_ub is not None: 2025-05-07T20:32:52.5982234Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.5982581Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.5982905Z ) 2025-05-07T20:32:52.5983104Z else: 2025-05-07T20:32:52.5983322Z scale_ub_tensor = None 2025-05-07T20:32:52.5983587Z 2025-05-07T20:32:52.5983826Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.5984235Z op = silu_mul_quant 2025-05-07T20:32:52.5984505Z if compiled: 2025-05-07T20:32:52.5984765Z op = torch.compile(op) 2025-05-07T20:32:52.5985067Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.5985424Z 2025-05-07T20:32:52.5985626Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.5985797Z 2025-05-07T20:32:52.5985900Z moe/activation_test.py:117: 2025-05-07T20:32:52.5986209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.5986554Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.5986915Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.5987497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.5988143Z return fn(*args, **kwargs) 2025-05-07T20:32:52.5988839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.5989559Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.5990124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.5990846Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.5991548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.5992104Z kernel = self.compile( 2025-05-07T20:32:52.5992672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.5993366Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.5993785Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.5994030Z 2025-05-07T20:32:52.5994247Z self = 2025-05-07T20:32:52.5995393Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.5996867Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0ce419e0>} 2025-05-07T20:32:52.5998298Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.5999428Z context = 2025-05-07T20:32:52.5999738Z 2025-05-07T20:32:52.5999961Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.6000516Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.6001009Z module_map=module_map) 2025-05-07T20:32:52.6001381Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.6001750Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.6002024Z E ^ 2025-05-07T20:32:52.6002505Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.6002988Z 2025-05-07T20:32:52.6003426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.6003976Z 2025-05-07T20:32:52.6004085Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.6004524Z self=, 2025-05-07T20:32:52.6004944Z T=2048, 2025-05-07T20:32:52.6005144Z D=5120, 2025-05-07T20:32:52.6005347Z scale_ub=1200.0, 2025-05-07T20:32:52.6005576Z contiguous=False, 2025-05-07T20:32:52.6005863Z compiled=True, 2025-05-07T20:32:52.6006076Z ) 2025-05-07T20:32:52.6006665Z self = 2025-05-07T20:32:52.6007193Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:52.6007563Z 2025-05-07T20:32:52.6007652Z @given( 2025-05-07T20:32:52.6007890Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.6008214Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.6008535Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.6008941Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.6009280Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.6009578Z ) 2025-05-07T20:32:52.6009947Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.6010407Z def test_silu_mul_quant( 2025-05-07T20:32:52.6010664Z self, 2025-05-07T20:32:52.6010873Z T: int, 2025-05-07T20:32:52.6011076Z D: int, 2025-05-07T20:32:52.6011308Z scale_ub: Optional[float], 2025-05-07T20:32:52.6011593Z contiguous: bool, 2025-05-07T20:32:52.6011913Z compiled: bool, 2025-05-07T20:32:52.6012152Z ) -> None: 2025-05-07T20:32:52.6012379Z torch.manual_seed(2025) 2025-05-07T20:32:52.6012632Z 2025-05-07T20:32:52.6012911Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.6013272Z 2025-05-07T20:32:52.6013476Z x_sign = torch.sign(x) 2025-05-07T20:32:52.6013775Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.6014105Z x = x_sign * x_clamp 2025-05-07T20:32:52.6014362Z x0 = x[:, :D] 2025-05-07T20:32:52.6014585Z x1 = x[:, D:] 2025-05-07T20:32:52.6014805Z 2025-05-07T20:32:52.6014998Z if contiguous: 2025-05-07T20:32:52.6015232Z x0 = x0.contiguous() 2025-05-07T20:32:52.6015500Z x1 = x1.contiguous() 2025-05-07T20:32:52.6015750Z 2025-05-07T20:32:52.6015941Z if scale_ub is not None: 2025-05-07T20:32:52.6016225Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.6016573Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.6016893Z ) 2025-05-07T20:32:52.6017098Z else: 2025-05-07T20:32:52.6017317Z scale_ub_tensor = None 2025-05-07T20:32:52.6017572Z 2025-05-07T20:32:52.6017842Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.6018197Z op = silu_mul_quant 2025-05-07T20:32:52.6018465Z if compiled: 2025-05-07T20:32:52.6018715Z op = torch.compile(op) 2025-05-07T20:32:52.6019024Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.6019312Z 2025-05-07T20:32:52.6019590Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.6019771Z 2025-05-07T20:32:52.6019877Z moe/activation_test.py:117: 2025-05-07T20:32:52.6020187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.6020529Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.6020823Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.6021413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.6022005Z return fn(*args, **kwargs) 2025-05-07T20:32:52.6022694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.6023430Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.6023998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.6024717Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.6025421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.6026054Z kernel = self.compile( 2025-05-07T20:32:52.6026632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.6027325Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.6027799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.6028083Z 2025-05-07T20:32:52.6028310Z self = 2025-05-07T20:32:52.6029471Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.6030983Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0ce42b60>} 2025-05-07T20:32:52.6032424Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.6033526Z context = 2025-05-07T20:32:52.6033830Z 2025-05-07T20:32:52.6034012Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.6034564Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.6035067Z module_map=module_map) 2025-05-07T20:32:52.6035448Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.6035820Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.6036091Z E ^ 2025-05-07T20:32:52.6036587Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.6037071Z 2025-05-07T20:32:52.6037521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.6038070Z 2025-05-07T20:32:52.7781642Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.7782117Z self=, 2025-05-07T20:32:52.7782700Z T=4096, 2025-05-07T20:32:52.7782897Z D=5120, 2025-05-07T20:32:52.7783093Z scale_ub=1200.0, 2025-05-07T20:32:52.7783321Z contiguous=True, 2025-05-07T20:32:52.7783547Z compiled=True, 2025-05-07T20:32:52.7783760Z ) 2025-05-07T20:32:52.7784096Z self = 2025-05-07T20:32:52.7784625Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:52.7785168Z 2025-05-07T20:32:52.7785257Z @given( 2025-05-07T20:32:52.7785486Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.7785815Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.7786137Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.7786472Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.7786812Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.7787114Z ) 2025-05-07T20:32:52.7790489Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.7790954Z def test_silu_mul_quant( 2025-05-07T20:32:52.7791205Z self, 2025-05-07T20:32:52.7791399Z T: int, 2025-05-07T20:32:52.7791607Z D: int, 2025-05-07T20:32:52.7791831Z scale_ub: Optional[float], 2025-05-07T20:32:52.7792105Z contiguous: bool, 2025-05-07T20:32:52.7792355Z compiled: bool, 2025-05-07T20:32:52.7792594Z ) -> None: 2025-05-07T20:32:52.7792815Z torch.manual_seed(2025) 2025-05-07T20:32:52.7793074Z 2025-05-07T20:32:52.7793371Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.7801109Z 2025-05-07T20:32:52.7801337Z x_sign = torch.sign(x) 2025-05-07T20:32:52.7801650Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.7801976Z x = x_sign * x_clamp 2025-05-07T20:32:52.7802220Z x0 = x[:, :D] 2025-05-07T20:32:52.7802443Z x1 = x[:, D:] 2025-05-07T20:32:52.7802686Z 2025-05-07T20:32:52.7802873Z if contiguous: 2025-05-07T20:32:52.7803115Z x0 = x0.contiguous() 2025-05-07T20:32:52.7803382Z x1 = x1.contiguous() 2025-05-07T20:32:52.7803624Z 2025-05-07T20:32:52.7803919Z if scale_ub is not None: 2025-05-07T20:32:52.7804199Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.7804546Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.7804872Z ) 2025-05-07T20:32:52.7805077Z else: 2025-05-07T20:32:52.7805290Z scale_ub_tensor = None 2025-05-07T20:32:52.7805572Z 2025-05-07T20:32:52.7805816Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.7806493Z op = silu_mul_quant 2025-05-07T20:32:52.7806753Z if compiled: 2025-05-07T20:32:52.7807011Z op = torch.compile(op) 2025-05-07T20:32:52.7807319Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.7807601Z 2025-05-07T20:32:52.7807800Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.7807969Z 2025-05-07T20:32:52.7808077Z moe/activation_test.py:117: 2025-05-07T20:32:52.7808382Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.7808730Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.7809019Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.7809602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.7810178Z return fn(*args, **kwargs) 2025-05-07T20:32:52.7810863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.7811574Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.7812203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.7812909Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.7813600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.7814158Z kernel = self.compile( 2025-05-07T20:32:52.7814717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.7815402Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.7815904Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.7816145Z 2025-05-07T20:32:52.7816370Z self = 2025-05-07T20:32:52.7817485Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.7819021Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd871f58220>} 2025-05-07T20:32:52.7820422Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.7821495Z context = 2025-05-07T20:32:52.7821792Z 2025-05-07T20:32:52.7821964Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.7822571Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.7823063Z module_map=module_map) 2025-05-07T20:32:52.7823441Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.7823802Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.7824073Z E ^ 2025-05-07T20:32:52.7824557Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.7825026Z 2025-05-07T20:32:52.7825525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.7826067Z 2025-05-07T20:32:52.7826173Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.7826608Z self=, 2025-05-07T20:32:52.7827029Z T=128, 2025-05-07T20:32:52.7827219Z D=5120, 2025-05-07T20:32:52.7827421Z scale_ub=1200.0, 2025-05-07T20:32:52.7827655Z contiguous=False, 2025-05-07T20:32:52.7827886Z compiled=True, 2025-05-07T20:32:52.7828123Z ) 2025-05-07T20:32:53.0494816Z self = 2025-05-07T20:32:53.0495483Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:53.0495902Z 2025-05-07T20:32:53.0496004Z @given( 2025-05-07T20:32:53.0496304Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.0496718Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.0497132Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.0497473Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.0497827Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.0498162Z ) 2025-05-07T20:32:53.0498519Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.0498977Z def test_silu_mul_quant( 2025-05-07T20:32:53.0499239Z self, 2025-05-07T20:32:53.0499439Z T: int, 2025-05-07T20:32:53.0499646Z D: int, 2025-05-07T20:32:53.0499876Z scale_ub: Optional[float], 2025-05-07T20:32:53.0500151Z contiguous: bool, 2025-05-07T20:32:53.0500403Z compiled: bool, 2025-05-07T20:32:53.0500642Z ) -> None: 2025-05-07T20:32:53.0500862Z torch.manual_seed(2025) 2025-05-07T20:32:53.0501112Z 2025-05-07T20:32:53.0501394Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.0501753Z 2025-05-07T20:32:53.0501952Z x_sign = torch.sign(x) 2025-05-07T20:32:53.0502258Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.0502583Z x = x_sign * x_clamp 2025-05-07T20:32:53.0502828Z x0 = x[:, :D] 2025-05-07T20:32:53.0503356Z x1 = x[:, D:] 2025-05-07T20:32:53.0503579Z 2025-05-07T20:32:53.0503770Z if contiguous: 2025-05-07T20:32:53.0504014Z x0 = x0.contiguous() 2025-05-07T20:32:53.0504288Z x1 = x1.contiguous() 2025-05-07T20:32:53.0504536Z 2025-05-07T20:32:53.0504739Z if scale_ub is not None: 2025-05-07T20:32:53.0505021Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.0505363Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.0505806Z ) 2025-05-07T20:32:53.0506009Z else: 2025-05-07T20:32:53.0506502Z scale_ub_tensor = None 2025-05-07T20:32:53.0506767Z 2025-05-07T20:32:53.0507009Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.0507336Z op = silu_mul_quant 2025-05-07T20:32:53.0507597Z if compiled: 2025-05-07T20:32:53.0507873Z op = torch.compile(op) 2025-05-07T20:32:53.0508218Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.0508498Z 2025-05-07T20:32:53.0508702Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.0508873Z 2025-05-07T20:32:53.0509075Z moe/activation_test.py:117: 2025-05-07T20:32:53.0509382Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.0509731Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.0510024Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.0510599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.0511184Z return fn(*args, **kwargs) 2025-05-07T20:32:53.0511866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.0512661Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.0513215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.0513928Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.0514619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.0515173Z kernel = self.compile( 2025-05-07T20:32:53.0515730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.0516412Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.0516829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.0517067Z 2025-05-07T20:32:53.0517281Z self = 2025-05-07T20:32:53.0518411Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.0519858Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd871f58ea0>} 2025-05-07T20:32:53.0521256Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.0522318Z context = 2025-05-07T20:32:53.0522617Z 2025-05-07T20:32:53.0522790Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.0523341Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.0523833Z module_map=module_map) 2025-05-07T20:32:53.0524210Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.0524646Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.0524922Z E ^ 2025-05-07T20:32:53.0525405Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.0525873Z 2025-05-07T20:32:53.0526305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.0526843Z 2025-05-07T20:32:53.0526950Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.0527446Z self=, 2025-05-07T20:32:53.0527867Z T=16384, 2025-05-07T20:32:53.0528068Z D=7168, 2025-05-07T20:32:53.0528274Z scale_ub=1200.0, 2025-05-07T20:32:53.0528513Z contiguous=True, 2025-05-07T20:32:53.0528741Z compiled=True, 2025-05-07T20:32:53.0528963Z ) 2025-05-07T20:32:53.0529310Z self = 2025-05-07T20:32:53.0529826Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:53.0530120Z 2025-05-07T20:32:53.0530204Z @given( 2025-05-07T20:32:53.0530496Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.0530817Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.0531138Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.0531480Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.0531921Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.0532216Z ) 2025-05-07T20:32:53.0532581Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.0533040Z def test_silu_mul_quant( 2025-05-07T20:32:53.0533287Z self, 2025-05-07T20:32:53.0533543Z T: int, 2025-05-07T20:32:53.0533760Z D: int, 2025-05-07T20:32:53.0533982Z scale_ub: Optional[float], 2025-05-07T20:32:53.0534264Z contiguous: bool, 2025-05-07T20:32:53.0534521Z compiled: bool, 2025-05-07T20:32:53.0534748Z ) -> None: 2025-05-07T20:32:53.0534971Z torch.manual_seed(2025) 2025-05-07T20:32:53.0535224Z 2025-05-07T20:32:53.0535503Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.0535858Z 2025-05-07T20:32:53.0536060Z x_sign = torch.sign(x) 2025-05-07T20:32:53.0536355Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.0536677Z x = x_sign * x_clamp 2025-05-07T20:32:53.0536930Z x0 = x[:, :D] 2025-05-07T20:32:53.0537157Z x1 = x[:, D:] 2025-05-07T20:32:53.0537373Z 2025-05-07T20:32:53.0537571Z if contiguous: 2025-05-07T20:32:53.0537813Z x0 = x0.contiguous() 2025-05-07T20:32:53.0538081Z x1 = x1.contiguous() 2025-05-07T20:32:53.0538341Z 2025-05-07T20:32:53.0538575Z if scale_ub is not None: 2025-05-07T20:32:53.0538853Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.0539200Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.0539532Z ) 2025-05-07T20:32:53.0539742Z else: 2025-05-07T20:32:53.0539956Z scale_ub_tensor = None 2025-05-07T20:32:53.0540224Z 2025-05-07T20:32:53.0540467Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.0540795Z op = silu_mul_quant 2025-05-07T20:32:53.0541051Z if compiled: 2025-05-07T20:32:53.0541312Z op = torch.compile(op) 2025-05-07T20:32:53.0541621Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.0541906Z 2025-05-07T20:32:53.0542108Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.0542278Z 2025-05-07T20:32:53.0542387Z moe/activation_test.py:117: 2025-05-07T20:32:53.0542695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.0543044Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.0543336Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.0543986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.0544572Z return fn(*args, **kwargs) 2025-05-07T20:32:53.0545258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.0545972Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.0546524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.0547281Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.0547994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.0548576Z kernel = self.compile( 2025-05-07T20:32:53.0549131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.0549813Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.0550230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.0550510Z 2025-05-07T20:32:53.0550728Z self = 2025-05-07T20:32:53.0551852Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.0553279Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd871f5a0c0>} 2025-05-07T20:32:53.0554721Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.0555808Z context = 2025-05-07T20:32:53.0556111Z 2025-05-07T20:32:53.0556285Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.0556837Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.0557332Z module_map=module_map) 2025-05-07T20:32:53.0557712Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.0558083Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.0558352Z E ^ 2025-05-07T20:32:53.0558875Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.0559355Z 2025-05-07T20:32:53.0559788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.0560334Z 2025-05-07T20:32:53.1786240Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.1786700Z self=, 2025-05-07T20:32:53.1787161Z T=16384, 2025-05-07T20:32:53.1787446Z D=5120, 2025-05-07T20:32:53.1787689Z scale_ub=1200.0, 2025-05-07T20:32:53.1787916Z contiguous=True, 2025-05-07T20:32:53.1788143Z compiled=False, 2025-05-07T20:32:53.1788384Z ) 2025-05-07T20:32:53.1788735Z self = 2025-05-07T20:32:53.1789265Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:53.1789556Z 2025-05-07T20:32:53.1789643Z @given( 2025-05-07T20:32:53.1789878Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.1790209Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.1790529Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.1790872Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.1791467Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.1791768Z ) 2025-05-07T20:32:53.1792131Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.1792581Z def test_silu_mul_quant( 2025-05-07T20:32:53.1792832Z self, 2025-05-07T20:32:53.1793037Z T: int, 2025-05-07T20:32:53.1793237Z D: int, 2025-05-07T20:32:53.1793465Z scale_ub: Optional[float], 2025-05-07T20:32:53.1793744Z contiguous: bool, 2025-05-07T20:32:53.1794081Z compiled: bool, 2025-05-07T20:32:53.1794317Z ) -> None: 2025-05-07T20:32:53.1794538Z torch.manual_seed(2025) 2025-05-07T20:32:53.1794781Z 2025-05-07T20:32:53.1795062Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.1795419Z 2025-05-07T20:32:53.1795618Z x_sign = torch.sign(x) 2025-05-07T20:32:53.1795910Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.1796233Z x = x_sign * x_clamp 2025-05-07T20:32:53.1796481Z x0 = x[:, :D] 2025-05-07T20:32:53.1796698Z x1 = x[:, D:] 2025-05-07T20:32:53.1796912Z 2025-05-07T20:32:53.1797188Z if contiguous: 2025-05-07T20:32:53.1797423Z x0 = x0.contiguous() 2025-05-07T20:32:53.1797690Z x1 = x1.contiguous() 2025-05-07T20:32:53.1797937Z 2025-05-07T20:32:53.1798130Z if scale_ub is not None: 2025-05-07T20:32:53.1798413Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.1798762Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.1799074Z ) 2025-05-07T20:32:53.1799274Z else: 2025-05-07T20:32:53.1799492Z scale_ub_tensor = None 2025-05-07T20:32:53.1799822Z 2025-05-07T20:32:53.1800060Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.1800386Z op = silu_mul_quant 2025-05-07T20:32:53.1800646Z if compiled: 2025-05-07T20:32:53.1800898Z op = torch.compile(op) 2025-05-07T20:32:53.1801202Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.1801485Z 2025-05-07T20:32:53.1801678Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.1801856Z 2025-05-07T20:32:53.1801960Z moe/activation_test.py:117: 2025-05-07T20:32:53.1802272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.1802610Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.1802902Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.1803621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.1804337Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.1804890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.1805597Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.1806558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.1807110Z kernel = self.compile( 2025-05-07T20:32:53.1807672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.1808356Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.1808771Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.1809009Z 2025-05-07T20:32:53.1809224Z self = 2025-05-07T20:32:53.1810345Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.1811951Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd871f59a80>} 2025-05-07T20:32:53.1813639Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.1814903Z context = 2025-05-07T20:32:53.1815243Z 2025-05-07T20:32:53.1815428Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.1816073Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.1816556Z module_map=module_map) 2025-05-07T20:32:53.1816929Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.1817292Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.1817562Z E ^ 2025-05-07T20:32:53.1818091Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.1818564Z 2025-05-07T20:32:53.1819054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.1819595Z 2025-05-07T20:32:53.1819701Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.1820131Z self=, 2025-05-07T20:32:53.1820546Z T=1, 2025-05-07T20:32:53.1820738Z D=7168, 2025-05-07T20:32:53.1820940Z scale_ub=1200.0, 2025-05-07T20:32:53.1821167Z contiguous=False, 2025-05-07T20:32:53.1821404Z compiled=False, 2025-05-07T20:32:53.1821619Z ) 2025-05-07T20:32:53.1822022Z self = 2025-05-07T20:32:53.1822526Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:53.1822808Z 2025-05-07T20:32:53.1822889Z @given( 2025-05-07T20:32:53.1823128Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.1823447Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.1823768Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.1824106Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.1824437Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.1824736Z ) 2025-05-07T20:32:53.1825098Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.1825560Z def test_silu_mul_quant( 2025-05-07T20:32:53.1825807Z self, 2025-05-07T20:32:53.1826012Z T: int, 2025-05-07T20:32:53.1826222Z D: int, 2025-05-07T20:32:53.1826448Z scale_ub: Optional[float], 2025-05-07T20:32:53.1826732Z contiguous: bool, 2025-05-07T20:32:53.1826985Z compiled: bool, 2025-05-07T20:32:53.1827211Z ) -> None: 2025-05-07T20:32:53.1827437Z torch.manual_seed(2025) 2025-05-07T20:32:53.1827696Z 2025-05-07T20:32:53.1827971Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.1828329Z 2025-05-07T20:32:53.1828535Z x_sign = torch.sign(x) 2025-05-07T20:32:53.1828830Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.1829151Z x = x_sign * x_clamp 2025-05-07T20:32:53.1829401Z x0 = x[:, :D] 2025-05-07T20:32:53.1829618Z x1 = x[:, D:] 2025-05-07T20:32:53.1829835Z 2025-05-07T20:32:53.1830032Z if contiguous: 2025-05-07T20:32:53.1830264Z x0 = x0.contiguous() 2025-05-07T20:32:53.1830533Z x1 = x1.contiguous() 2025-05-07T20:32:53.1830780Z 2025-05-07T20:32:53.1830977Z if scale_ub is not None: 2025-05-07T20:32:53.1831253Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.1831598Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.1831915Z ) 2025-05-07T20:32:53.1832107Z else: 2025-05-07T20:32:53.1832378Z scale_ub_tensor = None 2025-05-07T20:32:53.1832639Z 2025-05-07T20:32:53.1832874Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.1833202Z op = silu_mul_quant 2025-05-07T20:32:53.1833459Z if compiled: 2025-05-07T20:32:53.1833705Z op = torch.compile(op) 2025-05-07T20:32:53.1834010Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.1834295Z 2025-05-07T20:32:53.1834487Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.1834727Z 2025-05-07T20:32:53.1834829Z moe/activation_test.py:117: 2025-05-07T20:32:53.1835133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.1835478Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.1835766Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.1836484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.1837196Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.1837796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.1838510Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.1839249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.1839802Z kernel = self.compile( 2025-05-07T20:32:53.1840360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.1841041Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.1841520Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.1841758Z 2025-05-07T20:32:53.1841977Z self = 2025-05-07T20:32:53.1843107Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.1851597Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd871c640e0>} 2025-05-07T20:32:53.1853073Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.1854144Z context = 2025-05-07T20:32:53.1854448Z 2025-05-07T20:32:53.1854626Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.1855169Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.1855652Z module_map=module_map) 2025-05-07T20:32:53.1856036Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.1856398Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.1856667Z E ^ 2025-05-07T20:32:53.1857151Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.1857621Z 2025-05-07T20:32:53.1858063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.1858606Z 2025-05-07T20:32:53.3583063Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.3583556Z self=, 2025-05-07T20:32:53.3584002Z T=4096, 2025-05-07T20:32:53.3584197Z D=7168, 2025-05-07T20:32:53.3584386Z scale_ub=1200.0, 2025-05-07T20:32:53.3584615Z contiguous=False, 2025-05-07T20:32:53.3585134Z compiled=True, 2025-05-07T20:32:53.3585346Z ) 2025-05-07T20:32:53.3585678Z self = 2025-05-07T20:32:53.3586206Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:53.3586490Z 2025-05-07T20:32:53.3586580Z @given( 2025-05-07T20:32:53.3586813Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.3587136Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.3587552Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.3587886Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.3588223Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.3588521Z ) 2025-05-07T20:32:53.3588877Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.3589337Z def test_silu_mul_quant( 2025-05-07T20:32:53.3589590Z self, 2025-05-07T20:32:53.3589801Z T: int, 2025-05-07T20:32:53.3590003Z D: int, 2025-05-07T20:32:53.3590228Z scale_ub: Optional[float], 2025-05-07T20:32:53.3590509Z contiguous: bool, 2025-05-07T20:32:53.3590847Z compiled: bool, 2025-05-07T20:32:53.3591089Z ) -> None: 2025-05-07T20:32:53.3591313Z torch.manual_seed(2025) 2025-05-07T20:32:53.3591557Z 2025-05-07T20:32:53.3591840Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.3592196Z 2025-05-07T20:32:53.3592399Z x_sign = torch.sign(x) 2025-05-07T20:32:53.3592702Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.3593024Z x = x_sign * x_clamp 2025-05-07T20:32:53.3593269Z x0 = x[:, :D] 2025-05-07T20:32:53.3593575Z x1 = x[:, D:] 2025-05-07T20:32:53.3593794Z 2025-05-07T20:32:53.3593984Z if contiguous: 2025-05-07T20:32:53.3594228Z x0 = x0.contiguous() 2025-05-07T20:32:53.3594500Z x1 = x1.contiguous() 2025-05-07T20:32:53.3594748Z 2025-05-07T20:32:53.3594950Z if scale_ub is not None: 2025-05-07T20:32:53.3595232Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.3595580Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.3595896Z ) 2025-05-07T20:32:53.3596104Z else: 2025-05-07T20:32:53.3596324Z scale_ub_tensor = None 2025-05-07T20:32:53.3596579Z 2025-05-07T20:32:53.3596827Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.3597160Z op = silu_mul_quant 2025-05-07T20:32:53.3597416Z if compiled: 2025-05-07T20:32:53.3597673Z op = torch.compile(op) 2025-05-07T20:32:53.3597984Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.3598271Z 2025-05-07T20:32:53.3598494Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.3598687Z 2025-05-07T20:32:53.3598800Z moe/activation_test.py:117: 2025-05-07T20:32:53.3599104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.3599450Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.3599741Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.3600325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.3600901Z return fn(*args, **kwargs) 2025-05-07T20:32:53.3601581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.3602293Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.3602842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.3603553Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.3604237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.3604839Z kernel = self.compile( 2025-05-07T20:32:53.3605392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.3606073Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.3606743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.3606982Z 2025-05-07T20:32:53.3607205Z self = 2025-05-07T20:32:53.3608406Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.3609853Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd871c65300>} 2025-05-07T20:32:53.3611311Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.3612472Z context = 2025-05-07T20:32:53.3612768Z 2025-05-07T20:32:53.3612940Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.3613481Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.3613965Z module_map=module_map) 2025-05-07T20:32:53.3614342Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.3614767Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.3615032Z E ^ 2025-05-07T20:32:53.3615508Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.3615974Z 2025-05-07T20:32:53.3616416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.3616945Z 2025-05-07T20:32:53.3617052Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.3617478Z self=, 2025-05-07T20:32:53.3617892Z T=128, 2025-05-07T20:32:53.3618078Z D=7168, 2025-05-07T20:32:53.3618277Z scale_ub=1200.0, 2025-05-07T20:32:53.3618514Z contiguous=False, 2025-05-07T20:32:53.3618740Z compiled=True, 2025-05-07T20:32:53.3618947Z ) 2025-05-07T20:32:53.4529864Z self = 2025-05-07T20:32:53.4530528Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:53.4530876Z 2025-05-07T20:32:53.4530963Z @given( 2025-05-07T20:32:53.4531190Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.4531512Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.4531877Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.4532210Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.4532543Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.4532833Z ) 2025-05-07T20:32:53.4533188Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.4533638Z def test_silu_mul_quant( 2025-05-07T20:32:53.4533888Z self, 2025-05-07T20:32:53.4534096Z T: int, 2025-05-07T20:32:53.4534295Z D: int, 2025-05-07T20:32:53.4534526Z scale_ub: Optional[float], 2025-05-07T20:32:53.4534801Z contiguous: bool, 2025-05-07T20:32:53.4535040Z compiled: bool, 2025-05-07T20:32:53.4535273Z ) -> None: 2025-05-07T20:32:53.4535493Z torch.manual_seed(2025) 2025-05-07T20:32:53.4535732Z 2025-05-07T20:32:53.4536009Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.4536620Z 2025-05-07T20:32:53.4536821Z x_sign = torch.sign(x) 2025-05-07T20:32:53.4537112Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.4537428Z x = x_sign * x_clamp 2025-05-07T20:32:53.4537674Z x0 = x[:, :D] 2025-05-07T20:32:53.4537888Z x1 = x[:, D:] 2025-05-07T20:32:53.4538101Z 2025-05-07T20:32:53.4538291Z if contiguous: 2025-05-07T20:32:53.4538547Z x0 = x0.contiguous() 2025-05-07T20:32:53.4538836Z x1 = x1.contiguous() 2025-05-07T20:32:53.4539177Z 2025-05-07T20:32:53.4539368Z if scale_ub is not None: 2025-05-07T20:32:53.4539643Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.4539990Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.4540309Z ) 2025-05-07T20:32:53.4540510Z else: 2025-05-07T20:32:53.4540730Z scale_ub_tensor = None 2025-05-07T20:32:53.4540983Z 2025-05-07T20:32:53.4541225Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.4541550Z op = silu_mul_quant 2025-05-07T20:32:53.4541811Z if compiled: 2025-05-07T20:32:53.4542144Z op = torch.compile(op) 2025-05-07T20:32:53.4542450Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.4542734Z 2025-05-07T20:32:53.4542928Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.4543103Z 2025-05-07T20:32:53.4543206Z moe/activation_test.py:117: 2025-05-07T20:32:53.4543519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.4543857Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.4544152Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.4544736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.4545390Z return fn(*args, **kwargs) 2025-05-07T20:32:53.4546070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.4546785Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.4547353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.4548096Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.4548856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.4549470Z kernel = self.compile( 2025-05-07T20:32:53.4550038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.4550744Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.4551192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.4551436Z 2025-05-07T20:32:53.4551654Z self = 2025-05-07T20:32:53.4552793Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.4554251Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd871c66020>} 2025-05-07T20:32:53.4555652Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.4556725Z context = 2025-05-07T20:32:53.4557023Z 2025-05-07T20:32:53.4557196Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.4557793Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.4558282Z module_map=module_map) 2025-05-07T20:32:53.4558657Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.4559024Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.4559295Z E ^ 2025-05-07T20:32:53.4559778Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.4560296Z 2025-05-07T20:32:53.4560727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.4561265Z 2025-05-07T20:32:53.4561373Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.4561806Z self=, 2025-05-07T20:32:53.4562229Z T=2048, 2025-05-07T20:32:53.4562422Z D=7168, 2025-05-07T20:32:53.4562627Z scale_ub=None, 2025-05-07T20:32:53.4562854Z contiguous=True, 2025-05-07T20:32:53.4563080Z compiled=True, 2025-05-07T20:32:53.4563316Z ) 2025-05-07T20:32:53.4563723Z self = 2025-05-07T20:32:53.4564235Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:53.4564519Z 2025-05-07T20:32:53.4564599Z @given( 2025-05-07T20:32:53.4564840Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.4565170Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.4565484Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.4565830Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.4566179Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.4566515Z ) 2025-05-07T20:32:53.4566885Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.4567348Z def test_silu_mul_quant( 2025-05-07T20:32:53.4567596Z self, 2025-05-07T20:32:53.4567804Z T: int, 2025-05-07T20:32:53.4568012Z D: int, 2025-05-07T20:32:53.4568235Z scale_ub: Optional[float], 2025-05-07T20:32:53.4568520Z contiguous: bool, 2025-05-07T20:32:53.4568774Z compiled: bool, 2025-05-07T20:32:53.4569008Z ) -> None: 2025-05-07T20:32:53.4569226Z torch.manual_seed(2025) 2025-05-07T20:32:53.4569476Z 2025-05-07T20:32:53.4569758Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.4570107Z 2025-05-07T20:32:53.4570307Z x_sign = torch.sign(x) 2025-05-07T20:32:53.4570607Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.4570921Z x = x_sign * x_clamp 2025-05-07T20:32:53.4571174Z x0 = x[:, :D] 2025-05-07T20:32:53.4571402Z x1 = x[:, D:] 2025-05-07T20:32:53.4571610Z 2025-05-07T20:32:53.4571867Z if contiguous: 2025-05-07T20:32:53.4572109Z x0 = x0.contiguous() 2025-05-07T20:32:53.4572372Z x1 = x1.contiguous() 2025-05-07T20:32:53.4572623Z 2025-05-07T20:32:53.4572824Z if scale_ub is not None: 2025-05-07T20:32:53.4573103Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.4573450Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.4573772Z ) 2025-05-07T20:32:53.4573966Z else: 2025-05-07T20:32:53.4574184Z scale_ub_tensor = None 2025-05-07T20:32:53.4574443Z 2025-05-07T20:32:53.4574685Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.4575006Z op = silu_mul_quant 2025-05-07T20:32:53.4575266Z if compiled: 2025-05-07T20:32:53.4575523Z op = torch.compile(op) 2025-05-07T20:32:53.4575827Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.4576113Z 2025-05-07T20:32:53.4576316Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.4576485Z 2025-05-07T20:32:53.4576638Z moe/activation_test.py:117: 2025-05-07T20:32:53.4576950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.4577295Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.4577584Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.4578168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.4578753Z return fn(*args, **kwargs) 2025-05-07T20:32:53.4579440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.4580198Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.4580758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.4581473Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.4582170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.4582720Z kernel = self.compile( 2025-05-07T20:32:53.4583326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.4584012Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.4584418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.4584664Z 2025-05-07T20:32:53.4584882Z self = 2025-05-07T20:32:53.4586006Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.4587526Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd871c67240>} 2025-05-07T20:32:53.4588926Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.4589983Z context = 2025-05-07T20:32:53.4590288Z 2025-05-07T20:32:53.4590460Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.4591008Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.4591497Z module_map=module_map) 2025-05-07T20:32:53.4591867Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.4592235Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.4592504Z E ^ 2025-05-07T20:32:53.4592982Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.4593456Z 2025-05-07T20:32:53.4593891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.4594431Z 2025-05-07T20:32:53.5287859Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.5288469Z self=, 2025-05-07T20:32:53.5288884Z T=16384, 2025-05-07T20:32:53.5289091Z D=5120, 2025-05-07T20:32:53.5289314Z scale_ub=None, 2025-05-07T20:32:53.5289531Z contiguous=False, 2025-05-07T20:32:53.5289773Z compiled=False, 2025-05-07T20:32:53.5289991Z ) 2025-05-07T20:32:53.5290322Z self = 2025-05-07T20:32:53.5290863Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:53.5291155Z 2025-05-07T20:32:53.5291248Z @given( 2025-05-07T20:32:53.5291647Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.5292038Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.5292364Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.5292731Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.5293073Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.5293374Z ) 2025-05-07T20:32:53.5293749Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.5294210Z def test_silu_mul_quant( 2025-05-07T20:32:53.5294577Z self, 2025-05-07T20:32:53.5294786Z T: int, 2025-05-07T20:32:53.5294990Z D: int, 2025-05-07T20:32:53.5295223Z scale_ub: Optional[float], 2025-05-07T20:32:53.5295517Z contiguous: bool, 2025-05-07T20:32:53.5295764Z compiled: bool, 2025-05-07T20:32:53.5296009Z ) -> None: 2025-05-07T20:32:53.5296241Z torch.manual_seed(2025) 2025-05-07T20:32:53.5296500Z 2025-05-07T20:32:53.5296786Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.5297144Z 2025-05-07T20:32:53.5297351Z x_sign = torch.sign(x) 2025-05-07T20:32:53.5297738Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.5299918Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.5301977Z 2025-05-07T20:32:53.5302102Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:53.5302325Z 2025-05-07T20:32:53.5302444Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.5302878Z self=, 2025-05-07T20:32:53.5303293Z T=4096, 2025-05-07T20:32:53.5303492Z D=7168, 2025-05-07T20:32:53.5303692Z scale_ub=1200.0, 2025-05-07T20:32:53.5303917Z contiguous=True, 2025-05-07T20:32:53.5304146Z compiled=True, 2025-05-07T20:32:53.5304355Z ) 2025-05-07T20:32:53.5304682Z self = 2025-05-07T20:32:53.5305212Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:53.5305498Z 2025-05-07T20:32:53.5305588Z @given( 2025-05-07T20:32:53.5305821Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.5306305Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.5306629Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.5306974Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.5307314Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.5307611Z ) 2025-05-07T20:32:53.5307976Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.5308434Z def test_silu_mul_quant( 2025-05-07T20:32:53.5308690Z self, 2025-05-07T20:32:53.5308894Z T: int, 2025-05-07T20:32:53.5309096Z D: int, 2025-05-07T20:32:53.5309325Z scale_ub: Optional[float], 2025-05-07T20:32:53.5309608Z contiguous: bool, 2025-05-07T20:32:53.5309856Z compiled: bool, 2025-05-07T20:32:53.5310088Z ) -> None: 2025-05-07T20:32:53.5310311Z torch.manual_seed(2025) 2025-05-07T20:32:53.5310554Z 2025-05-07T20:32:53.5310834Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.5311185Z 2025-05-07T20:32:53.5311378Z x_sign = torch.sign(x) 2025-05-07T20:32:53.5311681Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.5313851Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.5315866Z 2025-05-07T20:32:53.5315990Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:53.5316210Z 2025-05-07T20:32:53.5316323Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.5316748Z self=, 2025-05-07T20:32:53.5317171Z T=16384, 2025-05-07T20:32:53.5317371Z D=7168, 2025-05-07T20:32:53.5317566Z scale_ub=None, 2025-05-07T20:32:53.5317792Z contiguous=False, 2025-05-07T20:32:53.5318030Z compiled=False, 2025-05-07T20:32:53.5318272Z ) 2025-05-07T20:32:53.5318677Z self = 2025-05-07T20:32:53.5319198Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:53.5319490Z 2025-05-07T20:32:53.5319578Z @given( 2025-05-07T20:32:53.5319807Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.5320132Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.5320451Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.5320784Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.5321124Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.5321487Z ) 2025-05-07T20:32:53.5321844Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.5322308Z def test_silu_mul_quant( 2025-05-07T20:32:53.5322556Z self, 2025-05-07T20:32:53.5330891Z T: int, 2025-05-07T20:32:53.5331116Z D: int, 2025-05-07T20:32:53.5331344Z scale_ub: Optional[float], 2025-05-07T20:32:53.5331633Z contiguous: bool, 2025-05-07T20:32:53.5331960Z compiled: bool, 2025-05-07T20:32:53.5332199Z ) -> None: 2025-05-07T20:32:53.5332432Z torch.manual_seed(2025) 2025-05-07T20:32:53.5332702Z 2025-05-07T20:32:53.5332987Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.5335154Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.5337132Z 2025-05-07T20:32:53.5337258Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:53.5337482Z 2025-05-07T20:32:53.5337597Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.5338035Z self=, 2025-05-07T20:32:53.5338457Z T=2048, 2025-05-07T20:32:53.5338652Z D=7168, 2025-05-07T20:32:53.5338853Z scale_ub=1200.0, 2025-05-07T20:32:53.5339087Z contiguous=True, 2025-05-07T20:32:53.5339315Z compiled=True, 2025-05-07T20:32:53.5339526Z ) 2025-05-07T20:32:53.5339852Z self = 2025-05-07T20:32:53.5340376Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:53.5340660Z 2025-05-07T20:32:53.5340745Z @given( 2025-05-07T20:32:53.5340977Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.5341376Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.5341698Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.5342042Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.5342377Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.5342673Z ) 2025-05-07T20:32:53.5343037Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.5343489Z def test_silu_mul_quant( 2025-05-07T20:32:53.5343796Z self, 2025-05-07T20:32:53.5343999Z T: int, 2025-05-07T20:32:53.5344197Z D: int, 2025-05-07T20:32:53.5344424Z scale_ub: Optional[float], 2025-05-07T20:32:53.5344703Z contiguous: bool, 2025-05-07T20:32:53.5344947Z compiled: bool, 2025-05-07T20:32:53.5345175Z ) -> None: 2025-05-07T20:32:53.5345400Z torch.manual_seed(2025) 2025-05-07T20:32:53.5345645Z 2025-05-07T20:32:53.5345932Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.5346285Z 2025-05-07T20:32:53.5346482Z x_sign = torch.sign(x) 2025-05-07T20:32:53.5346833Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.5348969Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.5350952Z 2025-05-07T20:32:53.5351073Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:53.5351292Z 2025-05-07T20:32:53.5351408Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.5351839Z self=, 2025-05-07T20:32:53.5352261Z T=2048, 2025-05-07T20:32:53.5352459Z D=7168, 2025-05-07T20:32:53.5352655Z scale_ub=None, 2025-05-07T20:32:53.5352886Z contiguous=True, 2025-05-07T20:32:53.5353120Z compiled=False, 2025-05-07T20:32:53.5353330Z ) 2025-05-07T20:32:53.6479314Z self = 2025-05-07T20:32:53.6479919Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:53.6480214Z 2025-05-07T20:32:53.6480311Z @given( 2025-05-07T20:32:53.6480550Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.6480882Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.6481216Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.6481565Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.6481904Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.6482209Z ) 2025-05-07T20:32:53.6482585Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.6483053Z def test_silu_mul_quant( 2025-05-07T20:32:53.6483317Z self, 2025-05-07T20:32:53.6483530Z T: int, 2025-05-07T20:32:53.6483738Z D: int, 2025-05-07T20:32:53.6483973Z scale_ub: Optional[float], 2025-05-07T20:32:53.6484267Z contiguous: bool, 2025-05-07T20:32:53.6484518Z compiled: bool, 2025-05-07T20:32:53.6484766Z ) -> None: 2025-05-07T20:32:53.6485000Z torch.manual_seed(2025) 2025-05-07T20:32:53.6485255Z 2025-05-07T20:32:53.6485552Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.6485923Z 2025-05-07T20:32:53.6486138Z > x_sign = torch.sign(x) 2025-05-07T20:32:53.6488293Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.6490246Z 2025-05-07T20:32:53.6490374Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:53.6490672Z 2025-05-07T20:32:53.6490781Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.6491221Z self=, 2025-05-07T20:32:53.6491642Z T=1, 2025-05-07T20:32:53.6491883Z D=7168, 2025-05-07T20:32:53.6492088Z scale_ub=1200.0, 2025-05-07T20:32:53.6492317Z contiguous=True, 2025-05-07T20:32:53.6492553Z compiled=False, 2025-05-07T20:32:53.6492770Z ) 2025-05-07T20:32:53.6493111Z self = 2025-05-07T20:32:53.6493620Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:53.6493975Z 2025-05-07T20:32:53.6494060Z @given( 2025-05-07T20:32:53.6494302Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.6494626Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.6494952Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.6495313Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.6495656Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.6495955Z ) 2025-05-07T20:32:53.6496324Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.6496865Z def test_silu_mul_quant( 2025-05-07T20:32:53.6497118Z self, 2025-05-07T20:32:53.6497326Z T: int, 2025-05-07T20:32:53.6497539Z D: int, 2025-05-07T20:32:53.6497768Z scale_ub: Optional[float], 2025-05-07T20:32:53.6498088Z contiguous: bool, 2025-05-07T20:32:53.6498364Z compiled: bool, 2025-05-07T20:32:53.6498595Z ) -> None: 2025-05-07T20:32:53.6498827Z torch.manual_seed(2025) 2025-05-07T20:32:53.6499082Z 2025-05-07T20:32:53.6499369Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.6499727Z 2025-05-07T20:32:53.6499933Z x_sign = torch.sign(x) 2025-05-07T20:32:53.6500232Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.6500562Z x = x_sign * x_clamp 2025-05-07T20:32:53.6500815Z x0 = x[:, :D] 2025-05-07T20:32:53.6501038Z x1 = x[:, D:] 2025-05-07T20:32:53.6501257Z 2025-05-07T20:32:53.6501457Z if contiguous: 2025-05-07T20:32:53.6501696Z x0 = x0.contiguous() 2025-05-07T20:32:53.6501969Z x1 = x1.contiguous() 2025-05-07T20:32:53.6502225Z 2025-05-07T20:32:53.6502429Z if scale_ub is not None: 2025-05-07T20:32:53.6502713Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.6503065Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.6503393Z ) 2025-05-07T20:32:53.6503593Z else: 2025-05-07T20:32:53.6503818Z scale_ub_tensor = None 2025-05-07T20:32:53.6504083Z 2025-05-07T20:32:53.6504322Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.6504653Z op = silu_mul_quant 2025-05-07T20:32:53.6504919Z if compiled: 2025-05-07T20:32:53.6505175Z op = torch.compile(op) 2025-05-07T20:32:53.6505486Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.6505776Z 2025-05-07T20:32:53.6505974Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.6506300Z 2025-05-07T20:32:53.6506407Z moe/activation_test.py:117: 2025-05-07T20:32:53.6506719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.6507146Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.6507442Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.6508175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.6508907Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.6509469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.6510188Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.6510968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.6511534Z kernel = self.compile( 2025-05-07T20:32:53.6512104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.6512807Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.6513230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.6513475Z 2025-05-07T20:32:53.6513765Z self = 2025-05-07T20:32:53.6514897Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.6516348Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd871da2520>} 2025-05-07T20:32:53.6517769Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.6518983Z context = 2025-05-07T20:32:53.6519291Z 2025-05-07T20:32:53.6519467Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.6520033Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.6520531Z module_map=module_map) 2025-05-07T20:32:53.6520922Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.6521292Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.6521572Z E ^ 2025-05-07T20:32:53.6522067Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.6522546Z 2025-05-07T20:32:53.6522988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.6523541Z 2025-05-07T20:32:53.6523649Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.6524090Z self=, 2025-05-07T20:32:53.6524523Z T=128, 2025-05-07T20:32:53.6524716Z D=5120, 2025-05-07T20:32:53.6524919Z scale_ub=None, 2025-05-07T20:32:53.6525153Z contiguous=True, 2025-05-07T20:32:53.6525384Z compiled=False, 2025-05-07T20:32:53.6525603Z ) 2025-05-07T20:32:53.7202131Z self = 2025-05-07T20:32:53.7202811Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:53.7203122Z 2025-05-07T20:32:53.7203213Z @given( 2025-05-07T20:32:53.7203452Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.7203782Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.7204114Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.7204468Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.7204823Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.7205116Z ) 2025-05-07T20:32:53.7205611Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.7206088Z def test_silu_mul_quant( 2025-05-07T20:32:53.7206576Z self, 2025-05-07T20:32:53.7206793Z T: int, 2025-05-07T20:32:53.7207009Z D: int, 2025-05-07T20:32:53.7207229Z scale_ub: Optional[float], 2025-05-07T20:32:53.7207520Z contiguous: bool, 2025-05-07T20:32:53.7207774Z compiled: bool, 2025-05-07T20:32:53.7208010Z ) -> None: 2025-05-07T20:32:53.7208324Z torch.manual_seed(2025) 2025-05-07T20:32:53.7208590Z 2025-05-07T20:32:53.7208880Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.7209252Z 2025-05-07T20:32:53.7209468Z x_sign = torch.sign(x) 2025-05-07T20:32:53.7209783Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.7210110Z x = x_sign * x_clamp 2025-05-07T20:32:53.7210370Z x0 = x[:, :D] 2025-05-07T20:32:53.7210613Z x1 = x[:, D:] 2025-05-07T20:32:53.7210830Z 2025-05-07T20:32:53.7211031Z if contiguous: 2025-05-07T20:32:53.7211284Z x0 = x0.contiguous() 2025-05-07T20:32:53.7211619Z x1 = x1.contiguous() 2025-05-07T20:32:53.7211939Z 2025-05-07T20:32:53.7212142Z if scale_ub is not None: 2025-05-07T20:32:53.7212424Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.7212779Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.7213107Z ) 2025-05-07T20:32:53.7213306Z else: 2025-05-07T20:32:53.7213529Z scale_ub_tensor = None 2025-05-07T20:32:53.7213797Z 2025-05-07T20:32:53.7214036Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.7214442Z op = silu_mul_quant 2025-05-07T20:32:53.7214711Z if compiled: 2025-05-07T20:32:53.7214972Z op = torch.compile(op) 2025-05-07T20:32:53.7215281Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.7215570Z 2025-05-07T20:32:53.7215774Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.7215947Z 2025-05-07T20:32:53.7216053Z moe/activation_test.py:117: 2025-05-07T20:32:53.7216374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.7216726Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.7217016Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.7217749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.7218489Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.7219061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.7219785Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.7220496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.7221070Z kernel = self.compile( 2025-05-07T20:32:53.7221641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.7222339Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.7222765Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.7223008Z 2025-05-07T20:32:53.7223232Z self = 2025-05-07T20:32:53.7224376Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.7225905Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd871da3420>} 2025-05-07T20:32:53.7227340Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.7228425Z context = 2025-05-07T20:32:53.7228729Z 2025-05-07T20:32:53.7228910Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.7229534Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.7230028Z module_map=module_map) 2025-05-07T20:32:53.7230412Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.7230782Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.7231059Z E ^ 2025-05-07T20:32:53.7231555Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.7232033Z 2025-05-07T20:32:53.7232528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.7233072Z 2025-05-07T20:32:53.7233181Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.7233621Z self=, 2025-05-07T20:32:53.7234051Z T=128, 2025-05-07T20:32:53.7234246Z D=7168, 2025-05-07T20:32:53.7234452Z scale_ub=None, 2025-05-07T20:32:53.7234681Z contiguous=True, 2025-05-07T20:32:53.7234913Z compiled=False, 2025-05-07T20:32:53.7235130Z ) 2025-05-07T20:32:53.7235468Z self = 2025-05-07T20:32:53.7236033Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:53.7236314Z 2025-05-07T20:32:53.7236395Z @given( 2025-05-07T20:32:53.7236635Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.7236963Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.7237283Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.7237632Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.7237979Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.7238311Z ) 2025-05-07T20:32:53.7238699Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.7239166Z def test_silu_mul_quant( 2025-05-07T20:32:53.7239422Z self, 2025-05-07T20:32:53.7239618Z T: int, 2025-05-07T20:32:53.7239821Z D: int, 2025-05-07T20:32:53.7240048Z scale_ub: Optional[float], 2025-05-07T20:32:53.7240322Z contiguous: bool, 2025-05-07T20:32:53.7240569Z compiled: bool, 2025-05-07T20:32:53.7240796Z ) -> None: 2025-05-07T20:32:53.7241013Z torch.manual_seed(2025) 2025-05-07T20:32:53.7241268Z 2025-05-07T20:32:53.7241554Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.7241912Z 2025-05-07T20:32:53.7242121Z x_sign = torch.sign(x) 2025-05-07T20:32:53.7242430Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.7242750Z x = x_sign * x_clamp 2025-05-07T20:32:53.7243004Z x0 = x[:, :D] 2025-05-07T20:32:53.7243233Z x1 = x[:, D:] 2025-05-07T20:32:53.7243446Z 2025-05-07T20:32:53.7243646Z if contiguous: 2025-05-07T20:32:53.7243894Z x0 = x0.contiguous() 2025-05-07T20:32:53.7244161Z x1 = x1.contiguous() 2025-05-07T20:32:53.7244413Z 2025-05-07T20:32:53.7244616Z if scale_ub is not None: 2025-05-07T20:32:53.7244903Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.7245250Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.7245575Z ) 2025-05-07T20:32:53.7245782Z else: 2025-05-07T20:32:53.7245997Z scale_ub_tensor = None 2025-05-07T20:32:53.7246310Z 2025-05-07T20:32:53.7246553Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.7246881Z op = silu_mul_quant 2025-05-07T20:32:53.7247145Z if compiled: 2025-05-07T20:32:53.7247406Z op = torch.compile(op) 2025-05-07T20:32:53.7247714Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.7248003Z 2025-05-07T20:32:53.7248226Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.7248421Z 2025-05-07T20:32:53.7248525Z moe/activation_test.py:117: 2025-05-07T20:32:53.7248907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.7249288Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.7249603Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.7250429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.7251267Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.7251951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.7252714Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.7253421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.7253988Z kernel = self.compile( 2025-05-07T20:32:53.7254560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.7255249Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.7255671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.7255957Z 2025-05-07T20:32:53.7256183Z self = 2025-05-07T20:32:53.7257337Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.7258787Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd8719484a0>} 2025-05-07T20:32:53.7260217Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.7261313Z context = 2025-05-07T20:32:53.7261616Z 2025-05-07T20:32:53.7261799Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.7262351Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.7262847Z module_map=module_map) 2025-05-07T20:32:53.7263232Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.7263602Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.7263871Z E ^ 2025-05-07T20:32:53.7264362Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.7264841Z 2025-05-07T20:32:53.7265284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.7265828Z 2025-05-07T20:32:53.7265940Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.7266374Z self=, 2025-05-07T20:32:53.7266806Z T=2048, 2025-05-07T20:32:53.7267008Z D=7168, 2025-05-07T20:32:53.7267203Z scale_ub=1200.0, 2025-05-07T20:32:53.7267440Z contiguous=True, 2025-05-07T20:32:53.7267672Z compiled=False, 2025-05-07T20:32:53.7267887Z ) 2025-05-07T20:32:53.8072990Z self = 2025-05-07T20:32:53.8074557Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:53.8075361Z 2025-05-07T20:32:53.8075582Z @given( 2025-05-07T20:32:53.8076079Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.8076723Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.8077369Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.8078124Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.8078524Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.8078819Z ) 2025-05-07T20:32:53.8079191Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.8079664Z def test_silu_mul_quant( 2025-05-07T20:32:53.8079917Z self, 2025-05-07T20:32:53.8080127Z T: int, 2025-05-07T20:32:53.8080337Z D: int, 2025-05-07T20:32:53.8080566Z scale_ub: Optional[float], 2025-05-07T20:32:53.8080857Z contiguous: bool, 2025-05-07T20:32:53.8081116Z compiled: bool, 2025-05-07T20:32:53.8081349Z ) -> None: 2025-05-07T20:32:53.8081646Z torch.manual_seed(2025) 2025-05-07T20:32:53.8081904Z 2025-05-07T20:32:53.8082188Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.8084343Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.8086355Z 2025-05-07T20:32:53.8086484Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:53.8086713Z 2025-05-07T20:32:53.8086825Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.8087262Z self=, 2025-05-07T20:32:53.8087683Z T=1, 2025-05-07T20:32:53.8087878Z D=5120, 2025-05-07T20:32:53.8088083Z scale_ub=1200.0, 2025-05-07T20:32:53.8088314Z contiguous=True, 2025-05-07T20:32:53.8088551Z compiled=False, 2025-05-07T20:32:53.8088775Z ) 2025-05-07T20:32:53.8089107Z self = 2025-05-07T20:32:53.8089624Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:53.8089911Z 2025-05-07T20:32:53.8089995Z @given( 2025-05-07T20:32:53.8090245Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.8090570Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.8090909Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.8091263Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.8091609Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.8091988Z ) 2025-05-07T20:32:53.8092362Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.8092824Z def test_silu_mul_quant( 2025-05-07T20:32:53.8093089Z self, 2025-05-07T20:32:53.8093302Z T: int, 2025-05-07T20:32:53.8093510Z D: int, 2025-05-07T20:32:53.8093745Z scale_ub: Optional[float], 2025-05-07T20:32:53.8101020Z contiguous: bool, 2025-05-07T20:32:53.8101303Z compiled: bool, 2025-05-07T20:32:53.8101542Z ) -> None: 2025-05-07T20:32:53.8101778Z torch.manual_seed(2025) 2025-05-07T20:32:53.8102047Z 2025-05-07T20:32:53.8102353Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.8102738Z 2025-05-07T20:32:53.8102946Z x_sign = torch.sign(x) 2025-05-07T20:32:53.8103345Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.8103672Z x = x_sign * x_clamp 2025-05-07T20:32:53.8103921Z x0 = x[:, :D] 2025-05-07T20:32:53.8104147Z x1 = x[:, D:] 2025-05-07T20:32:53.8104364Z 2025-05-07T20:32:53.8104552Z if contiguous: 2025-05-07T20:32:53.8104794Z x0 = x0.contiguous() 2025-05-07T20:32:53.8105065Z x1 = x1.contiguous() 2025-05-07T20:32:53.8105308Z 2025-05-07T20:32:53.8105551Z if scale_ub is not None: 2025-05-07T20:32:53.8105834Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.8106368Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.8106700Z ) 2025-05-07T20:32:53.8106904Z else: 2025-05-07T20:32:53.8107123Z scale_ub_tensor = None 2025-05-07T20:32:53.8107385Z 2025-05-07T20:32:53.8107628Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.8107959Z op = silu_mul_quant 2025-05-07T20:32:53.8108223Z if compiled: 2025-05-07T20:32:53.8108481Z op = torch.compile(op) 2025-05-07T20:32:53.8108875Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.8109166Z 2025-05-07T20:32:53.8109367Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.8109539Z 2025-05-07T20:32:53.8109653Z moe/activation_test.py:117: 2025-05-07T20:32:53.8109960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.8110314Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.8110609Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.8111331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.8112117Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.8112681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.8113403Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.8114099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.8114661Z kernel = self.compile( 2025-05-07T20:32:53.8115229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.8115909Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.8116333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.8116578Z 2025-05-07T20:32:53.8116796Z self = 2025-05-07T20:32:53.8117937Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.8119386Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd871949a80>} 2025-05-07T20:32:53.8120796Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.8121871Z context = 2025-05-07T20:32:53.8122180Z 2025-05-07T20:32:53.8122355Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.8122905Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.8123394Z module_map=module_map) 2025-05-07T20:32:53.8123784Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.8124254Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.8124523Z E ^ 2025-05-07T20:32:53.8125009Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.8125485Z 2025-05-07T20:32:53.8125923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.8126461Z 2025-05-07T20:32:53.8126573Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.8127064Z self=, 2025-05-07T20:32:53.8127493Z T=2048, 2025-05-07T20:32:53.8127689Z D=5120, 2025-05-07T20:32:53.8127889Z scale_ub=None, 2025-05-07T20:32:53.8128156Z contiguous=True, 2025-05-07T20:32:53.8128402Z compiled=False, 2025-05-07T20:32:53.8128611Z ) 2025-05-07T20:32:53.8128942Z self = 2025-05-07T20:32:53.8129466Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:53.8129752Z 2025-05-07T20:32:53.8129838Z @given( 2025-05-07T20:32:53.8130119Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.8130475Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.8130820Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.8131190Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.8131565Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.8131943Z ) 2025-05-07T20:32:53.8132307Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.8132772Z def test_silu_mul_quant( 2025-05-07T20:32:53.8133027Z self, 2025-05-07T20:32:53.8133277Z T: int, 2025-05-07T20:32:53.8133485Z D: int, 2025-05-07T20:32:53.8133713Z scale_ub: Optional[float], 2025-05-07T20:32:53.8133996Z contiguous: bool, 2025-05-07T20:32:53.8134244Z compiled: bool, 2025-05-07T20:32:53.8134478Z ) -> None: 2025-05-07T20:32:53.8134701Z torch.manual_seed(2025) 2025-05-07T20:32:53.8134950Z 2025-05-07T20:32:53.8135234Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.8135594Z 2025-05-07T20:32:53.8135794Z > x_sign = torch.sign(x) 2025-05-07T20:32:53.8137840Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.8139805Z 2025-05-07T20:32:53.8139932Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:53.8140152Z 2025-05-07T20:32:53.8140263Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.8140699Z self=, 2025-05-07T20:32:53.8141117Z T=16384, 2025-05-07T20:32:53.8141318Z D=5120, 2025-05-07T20:32:53.8141517Z scale_ub=None, 2025-05-07T20:32:53.8141735Z contiguous=True, 2025-05-07T20:32:53.8141964Z compiled=False, 2025-05-07T20:32:53.8142176Z ) 2025-05-07T20:32:53.8884758Z self = 2025-05-07T20:32:53.8885584Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:53.8885981Z 2025-05-07T20:32:53.8886086Z @given( 2025-05-07T20:32:53.8886412Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.8886791Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.8887106Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.8887567Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.8887909Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.8888200Z ) 2025-05-07T20:32:53.8888577Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.8889044Z def test_silu_mul_quant( 2025-05-07T20:32:53.8889290Z self, 2025-05-07T20:32:53.8889492Z T: int, 2025-05-07T20:32:53.8889699Z D: int, 2025-05-07T20:32:53.8889926Z scale_ub: Optional[float], 2025-05-07T20:32:53.8890276Z contiguous: bool, 2025-05-07T20:32:53.8890535Z compiled: bool, 2025-05-07T20:32:53.8890768Z ) -> None: 2025-05-07T20:32:53.8890992Z torch.manual_seed(2025) 2025-05-07T20:32:53.8891253Z 2025-05-07T20:32:53.8891531Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.8893850Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.8895839Z 2025-05-07T20:32:53.8895966Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:53.8896191Z 2025-05-07T20:32:53.8896304Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.8896741Z self=, 2025-05-07T20:32:53.8897228Z T=4096, 2025-05-07T20:32:53.8897424Z D=5120, 2025-05-07T20:32:53.8897627Z scale_ub=None, 2025-05-07T20:32:53.8897850Z contiguous=True, 2025-05-07T20:32:53.8898112Z compiled=False, 2025-05-07T20:32:53.8898357Z ) 2025-05-07T20:32:53.8898688Z self = 2025-05-07T20:32:53.8899217Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:53.8899510Z 2025-05-07T20:32:53.8899593Z @given( 2025-05-07T20:32:53.8899835Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.8900158Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.8900481Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.8900837Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.8901183Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.8901492Z ) 2025-05-07T20:32:53.8901861Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.8902328Z def test_silu_mul_quant( 2025-05-07T20:32:53.8902582Z self, 2025-05-07T20:32:53.8902793Z T: int, 2025-05-07T20:32:53.8903004Z D: int, 2025-05-07T20:32:53.8903227Z scale_ub: Optional[float], 2025-05-07T20:32:53.8903516Z contiguous: bool, 2025-05-07T20:32:53.8903773Z compiled: bool, 2025-05-07T20:32:53.8904008Z ) -> None: 2025-05-07T20:32:53.8904234Z torch.manual_seed(2025) 2025-05-07T20:32:53.8904485Z 2025-05-07T20:32:53.8904763Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.8907259Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.8909289Z 2025-05-07T20:32:53.8909415Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:53.8909642Z 2025-05-07T20:32:53.8909754Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.8910231Z self=, 2025-05-07T20:32:53.8910697Z T=2048, 2025-05-07T20:32:53.8910897Z D=5120, 2025-05-07T20:32:53.8911111Z scale_ub=None, 2025-05-07T20:32:53.8911343Z contiguous=False, 2025-05-07T20:32:53.8911647Z compiled=False, 2025-05-07T20:32:53.8911863Z ) 2025-05-07T20:32:53.8912197Z self = 2025-05-07T20:32:53.8912725Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:53.8913015Z 2025-05-07T20:32:53.8913102Z @given( 2025-05-07T20:32:53.8913336Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.8913668Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.8913998Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.8914351Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.8914759Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.8915059Z ) 2025-05-07T20:32:53.8915423Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.8915879Z def test_silu_mul_quant( 2025-05-07T20:32:53.8916133Z self, 2025-05-07T20:32:53.8916336Z T: int, 2025-05-07T20:32:53.8916546Z D: int, 2025-05-07T20:32:53.8916775Z scale_ub: Optional[float], 2025-05-07T20:32:53.8917052Z contiguous: bool, 2025-05-07T20:32:53.8917299Z compiled: bool, 2025-05-07T20:32:53.8917531Z ) -> None: 2025-05-07T20:32:53.8917814Z torch.manual_seed(2025) 2025-05-07T20:32:53.8918069Z 2025-05-07T20:32:53.8918355Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.8920514Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.8922480Z 2025-05-07T20:32:53.8922610Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:53.8922833Z 2025-05-07T20:32:53.8922943Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.8923377Z self=, 2025-05-07T20:32:53.8923801Z T=4096, 2025-05-07T20:32:53.8923988Z D=7168, 2025-05-07T20:32:53.8924185Z scale_ub=None, 2025-05-07T20:32:53.8924409Z contiguous=True, 2025-05-07T20:32:53.8924633Z compiled=True, 2025-05-07T20:32:53.8924848Z ) 2025-05-07T20:32:53.8925184Z self = 2025-05-07T20:32:53.8925697Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:53.8925986Z 2025-05-07T20:32:53.8926067Z @given( 2025-05-07T20:32:53.8926329Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.8926661Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.8926981Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.8927327Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.8927671Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.8927969Z ) 2025-05-07T20:32:53.8928385Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.8928849Z def test_silu_mul_quant( 2025-05-07T20:32:53.8929100Z self, 2025-05-07T20:32:53.8929348Z T: int, 2025-05-07T20:32:53.8929554Z D: int, 2025-05-07T20:32:53.8929780Z scale_ub: Optional[float], 2025-05-07T20:32:53.8930057Z contiguous: bool, 2025-05-07T20:32:53.8930304Z compiled: bool, 2025-05-07T20:32:53.8930535Z ) -> None: 2025-05-07T20:32:53.8930753Z torch.manual_seed(2025) 2025-05-07T20:32:53.8931005Z 2025-05-07T20:32:53.8931285Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.8933576Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.8935553Z 2025-05-07T20:32:53.8935682Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:53.8935948Z 2025-05-07T20:32:53.8936056Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.8936488Z self=, 2025-05-07T20:32:53.8936913Z T=2048, 2025-05-07T20:32:53.8937101Z D=5120, 2025-05-07T20:32:53.8937298Z scale_ub=1200.0, 2025-05-07T20:32:53.8937537Z contiguous=False, 2025-05-07T20:32:53.8937765Z compiled=False, 2025-05-07T20:32:53.8937974Z ) 2025-05-07T20:32:53.8938356Z self = 2025-05-07T20:32:53.8938951Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:53.8939249Z 2025-05-07T20:32:53.8939329Z @given( 2025-05-07T20:32:53.8939568Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.8939895Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.8940213Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.8940557Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.8940901Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.8941194Z ) 2025-05-07T20:32:53.8941557Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.8942021Z def test_silu_mul_quant( 2025-05-07T20:32:53.8942269Z self, 2025-05-07T20:32:53.8942471Z T: int, 2025-05-07T20:32:53.8942673Z D: int, 2025-05-07T20:32:53.8942896Z scale_ub: Optional[float], 2025-05-07T20:32:53.8943178Z contiguous: bool, 2025-05-07T20:32:53.8943431Z compiled: bool, 2025-05-07T20:32:53.8943656Z ) -> None: 2025-05-07T20:32:53.8943878Z torch.manual_seed(2025) 2025-05-07T20:32:53.8944125Z 2025-05-07T20:32:53.8944413Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.8946580Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.8948546Z 2025-05-07T20:32:53.8948668Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:53.8948898Z 2025-05-07T20:32:53.8949004Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.8949435Z self=, 2025-05-07T20:32:53.8949856Z T=4096, 2025-05-07T20:32:53.8950049Z D=7168, 2025-05-07T20:32:53.8950294Z scale_ub=1200.0, 2025-05-07T20:32:53.8950528Z contiguous=True, 2025-05-07T20:32:53.8950755Z compiled=False, 2025-05-07T20:32:53.8950967Z ) 2025-05-07T20:32:54.0019900Z self = 2025-05-07T20:32:54.0020702Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.0021088Z 2025-05-07T20:32:54.0021195Z @given( 2025-05-07T20:32:54.0021464Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.0021896Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.0022216Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.0022555Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.0022906Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.0023202Z ) 2025-05-07T20:32:54.0023560Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.0024029Z def test_silu_mul_quant( 2025-05-07T20:32:54.0024286Z self, 2025-05-07T20:32:54.0024489Z T: int, 2025-05-07T20:32:54.0024703Z D: int, 2025-05-07T20:32:54.0025004Z scale_ub: Optional[float], 2025-05-07T20:32:54.0025287Z contiguous: bool, 2025-05-07T20:32:54.0025539Z compiled: bool, 2025-05-07T20:32:54.0025776Z ) -> None: 2025-05-07T20:32:54.0025999Z torch.manual_seed(2025) 2025-05-07T20:32:54.0026260Z 2025-05-07T20:32:54.0026558Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.0028782Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.0030801Z 2025-05-07T20:32:54.0030936Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.0031158Z 2025-05-07T20:32:54.0031266Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.0031701Z self=, 2025-05-07T20:32:54.0032124Z T=16384, 2025-05-07T20:32:54.0032330Z D=7168, 2025-05-07T20:32:54.0032532Z scale_ub=None, 2025-05-07T20:32:54.0032761Z contiguous=False, 2025-05-07T20:32:54.0032991Z compiled=True, 2025-05-07T20:32:54.0033202Z ) 2025-05-07T20:32:54.0033542Z self = 2025-05-07T20:32:54.0034076Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.0034367Z 2025-05-07T20:32:54.0034452Z @given( 2025-05-07T20:32:54.0034699Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.0035029Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.0035348Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.0035694Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.0036044Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.0036341Z ) 2025-05-07T20:32:54.0036707Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.0037182Z def test_silu_mul_quant( 2025-05-07T20:32:54.0037439Z self, 2025-05-07T20:32:54.0037640Z T: int, 2025-05-07T20:32:54.0037851Z D: int, 2025-05-07T20:32:54.0038081Z scale_ub: Optional[float], 2025-05-07T20:32:54.0038393Z contiguous: bool, 2025-05-07T20:32:54.0038673Z compiled: bool, 2025-05-07T20:32:54.0038904Z ) -> None: 2025-05-07T20:32:54.0039127Z torch.manual_seed(2025) 2025-05-07T20:32:54.0039454Z 2025-05-07T20:32:54.0039739Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.0041884Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.0043880Z 2025-05-07T20:32:54.0044008Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.0044239Z 2025-05-07T20:32:54.0044345Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.0044779Z self=, 2025-05-07T20:32:54.0045214Z T=4096, 2025-05-07T20:32:54.0045408Z D=7168, 2025-05-07T20:32:54.0045610Z scale_ub=None, 2025-05-07T20:32:54.0045837Z contiguous=True, 2025-05-07T20:32:54.0046114Z compiled=False, 2025-05-07T20:32:54.0046342Z ) 2025-05-07T20:32:54.0046680Z self = 2025-05-07T20:32:54.0047195Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.0047489Z 2025-05-07T20:32:54.0047575Z @given( 2025-05-07T20:32:54.0047820Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.0048145Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.0048469Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.0048863Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.0049213Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.0049509Z ) 2025-05-07T20:32:54.0049880Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.0050345Z def test_silu_mul_quant( 2025-05-07T20:32:54.0050594Z self, 2025-05-07T20:32:54.0050806Z T: int, 2025-05-07T20:32:54.0051016Z D: int, 2025-05-07T20:32:54.0051242Z scale_ub: Optional[float], 2025-05-07T20:32:54.0051525Z contiguous: bool, 2025-05-07T20:32:54.0051777Z compiled: bool, 2025-05-07T20:32:54.0052065Z ) -> None: 2025-05-07T20:32:54.0052294Z torch.manual_seed(2025) 2025-05-07T20:32:54.0052551Z 2025-05-07T20:32:54.0052829Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.0054991Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.0056946Z 2025-05-07T20:32:54.0057071Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.0057301Z 2025-05-07T20:32:54.0057409Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.0057848Z self=, 2025-05-07T20:32:54.0058318Z T=16384, 2025-05-07T20:32:54.0058522Z D=7168, 2025-05-07T20:32:54.0058728Z scale_ub=None, 2025-05-07T20:32:54.0058949Z contiguous=True, 2025-05-07T20:32:54.0059190Z compiled=False, 2025-05-07T20:32:54.0059411Z ) 2025-05-07T20:32:54.0059740Z self = 2025-05-07T20:32:54.0060264Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.0060614Z 2025-05-07T20:32:54.0060698Z @given( 2025-05-07T20:32:54.0060942Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.0061284Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.0061603Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.0061948Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.0062301Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.0062599Z ) 2025-05-07T20:32:54.0063014Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.0063479Z def test_silu_mul_quant( 2025-05-07T20:32:54.0063726Z self, 2025-05-07T20:32:54.0063936Z T: int, 2025-05-07T20:32:54.0064156Z D: int, 2025-05-07T20:32:54.0064384Z scale_ub: Optional[float], 2025-05-07T20:32:54.0064681Z contiguous: bool, 2025-05-07T20:32:54.0064943Z compiled: bool, 2025-05-07T20:32:54.0071664Z ) -> None: 2025-05-07T20:32:54.0071902Z torch.manual_seed(2025) 2025-05-07T20:32:54.0072156Z 2025-05-07T20:32:54.0072439Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.0074647Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.0076646Z 2025-05-07T20:32:54.0076778Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.0077002Z 2025-05-07T20:32:54.0077108Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.0077541Z self=, 2025-05-07T20:32:54.0077964Z T=16384, 2025-05-07T20:32:54.0078162Z D=7168, 2025-05-07T20:32:54.0078365Z scale_ub=1200.0, 2025-05-07T20:32:54.0078629Z contiguous=True, 2025-05-07T20:32:54.0078869Z compiled=False, 2025-05-07T20:32:54.0079081Z ) 2025-05-07T20:32:54.0079410Z self = 2025-05-07T20:32:54.0079926Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.0080218Z 2025-05-07T20:32:54.0080300Z @given( 2025-05-07T20:32:54.0080538Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.0080860Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.0081179Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.0081523Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.0081863Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.0082155Z ) 2025-05-07T20:32:54.0082519Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.0082977Z def test_silu_mul_quant( 2025-05-07T20:32:54.0083223Z self, 2025-05-07T20:32:54.0083430Z T: int, 2025-05-07T20:32:54.0083638Z D: int, 2025-05-07T20:32:54.0083860Z scale_ub: Optional[float], 2025-05-07T20:32:54.0084135Z contiguous: bool, 2025-05-07T20:32:54.0084380Z compiled: bool, 2025-05-07T20:32:54.0084611Z ) -> None: 2025-05-07T20:32:54.0084829Z torch.manual_seed(2025) 2025-05-07T20:32:54.0085082Z 2025-05-07T20:32:54.0085362Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.0088185Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.0090144Z 2025-05-07T20:32:54.0090267Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.0090493Z 2025-05-07T20:32:54.0090601Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.0091078Z self=, 2025-05-07T20:32:54.0091500Z T=128, 2025-05-07T20:32:54.0091692Z D=5120, 2025-05-07T20:32:54.0091948Z scale_ub=1200.0, 2025-05-07T20:32:54.0092182Z contiguous=False, 2025-05-07T20:32:54.0092408Z compiled=False, 2025-05-07T20:32:54.0092619Z ) 2025-05-07T20:32:54.1377652Z self = 2025-05-07T20:32:54.1378408Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.1378800Z 2025-05-07T20:32:54.1378910Z @given( 2025-05-07T20:32:54.1379399Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.1379772Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.1380085Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.1380433Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.1380802Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.1381104Z ) 2025-05-07T20:32:54.1381462Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.1381927Z def test_silu_mul_quant( 2025-05-07T20:32:54.1382263Z self, 2025-05-07T20:32:54.1382462Z T: int, 2025-05-07T20:32:54.1382673Z D: int, 2025-05-07T20:32:54.1382906Z scale_ub: Optional[float], 2025-05-07T20:32:54.1383182Z contiguous: bool, 2025-05-07T20:32:54.1383433Z compiled: bool, 2025-05-07T20:32:54.1383672Z ) -> None: 2025-05-07T20:32:54.1383893Z torch.manual_seed(2025) 2025-05-07T20:32:54.1384146Z 2025-05-07T20:32:54.1384430Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.1384787Z 2025-05-07T20:32:54.1384985Z x_sign = torch.sign(x) 2025-05-07T20:32:54.1385290Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.1385610Z x = x_sign * x_clamp 2025-05-07T20:32:54.1385860Z x0 = x[:, :D] 2025-05-07T20:32:54.1386082Z x1 = x[:, D:] 2025-05-07T20:32:54.1386296Z 2025-05-07T20:32:54.1386487Z if contiguous: 2025-05-07T20:32:54.1386730Z x0 = x0.contiguous() 2025-05-07T20:32:54.1387006Z x1 = x1.contiguous() 2025-05-07T20:32:54.1387251Z 2025-05-07T20:32:54.1387455Z if scale_ub is not None: 2025-05-07T20:32:54.1387740Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.1388081Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.1388404Z ) 2025-05-07T20:32:54.1388606Z else: 2025-05-07T20:32:54.1388824Z scale_ub_tensor = None 2025-05-07T20:32:54.1389083Z 2025-05-07T20:32:54.1389328Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.1389652Z op = silu_mul_quant 2025-05-07T20:32:54.1389919Z if compiled: 2025-05-07T20:32:54.1390175Z op = torch.compile(op) 2025-05-07T20:32:54.1390485Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.1390766Z 2025-05-07T20:32:54.1390972Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.1391144Z 2025-05-07T20:32:54.1391254Z moe/activation_test.py:117: 2025-05-07T20:32:54.1391562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.1391909Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.1392200Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.1392990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.1393712Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.1394277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.1394991Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.1395678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.1396295Z kernel = self.compile( 2025-05-07T20:32:54.1396856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.1397539Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.1397952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.1398201Z 2025-05-07T20:32:54.1398444Z self = 2025-05-07T20:32:54.1399637Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.1401074Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd8718307c0>} 2025-05-07T20:32:54.1402478Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.1403606Z context = 2025-05-07T20:32:54.1403912Z 2025-05-07T20:32:54.1404090Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.1404636Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.1405124Z module_map=module_map) 2025-05-07T20:32:54.1405509Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.1405877Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.1406378Z E ^ 2025-05-07T20:32:54.1406874Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.1407356Z 2025-05-07T20:32:54.1407795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.1408365Z 2025-05-07T20:32:54.1408500Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.1408936Z self=, 2025-05-07T20:32:54.1409360Z T=2048, 2025-05-07T20:32:54.1409552Z D=7168, 2025-05-07T20:32:54.1409746Z scale_ub=None, 2025-05-07T20:32:54.1409971Z contiguous=False, 2025-05-07T20:32:54.1410207Z compiled=False, 2025-05-07T20:32:54.1410414Z ) 2025-05-07T20:32:54.1410752Z self = 2025-05-07T20:32:54.1411280Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.1411571Z 2025-05-07T20:32:54.1411656Z @given( 2025-05-07T20:32:54.1411944Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.1412272Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.1412591Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.1412927Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.1413273Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.1413564Z ) 2025-05-07T20:32:54.1414007Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.1414469Z def test_silu_mul_quant( 2025-05-07T20:32:54.1414721Z self, 2025-05-07T20:32:54.1414923Z T: int, 2025-05-07T20:32:54.1415125Z D: int, 2025-05-07T20:32:54.1415350Z scale_ub: Optional[float], 2025-05-07T20:32:54.1415627Z contiguous: bool, 2025-05-07T20:32:54.1415872Z compiled: bool, 2025-05-07T20:32:54.1416099Z ) -> None: 2025-05-07T20:32:54.1416318Z torch.manual_seed(2025) 2025-05-07T20:32:54.1416566Z 2025-05-07T20:32:54.1416915Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.1419162Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.1421149Z 2025-05-07T20:32:54.1421276Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.1421496Z 2025-05-07T20:32:54.1421604Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.1422031Z self=, 2025-05-07T20:32:54.1422458Z T=128, 2025-05-07T20:32:54.1422649Z D=7168, 2025-05-07T20:32:54.1422842Z scale_ub=1200.0, 2025-05-07T20:32:54.1423069Z contiguous=True, 2025-05-07T20:32:54.1423296Z compiled=True, 2025-05-07T20:32:54.1423567Z ) 2025-05-07T20:32:54.1732192Z self = 2025-05-07T20:32:54.1732926Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.1733315Z 2025-05-07T20:32:54.1733446Z @given( 2025-05-07T20:32:54.1733806Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.1734232Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.1734596Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.1734940Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.1735278Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.1735572Z ) 2025-05-07T20:32:54.1735937Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.1736405Z def test_silu_mul_quant( 2025-05-07T20:32:54.1736649Z self, 2025-05-07T20:32:54.1736847Z T: int, 2025-05-07T20:32:54.1737051Z D: int, 2025-05-07T20:32:54.1737277Z scale_ub: Optional[float], 2025-05-07T20:32:54.1737565Z contiguous: bool, 2025-05-07T20:32:54.1737818Z compiled: bool, 2025-05-07T20:32:54.1738042Z ) -> None: 2025-05-07T20:32:54.1738279Z torch.manual_seed(2025) 2025-05-07T20:32:54.1738575Z 2025-05-07T20:32:54.1738848Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.1739205Z 2025-05-07T20:32:54.1739408Z x_sign = torch.sign(x) 2025-05-07T20:32:54.1739707Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.1740030Z x = x_sign * x_clamp 2025-05-07T20:32:54.1740282Z x0 = x[:, :D] 2025-05-07T20:32:54.1740502Z x1 = x[:, D:] 2025-05-07T20:32:54.1740717Z 2025-05-07T20:32:54.1740906Z if contiguous: 2025-05-07T20:32:54.1741139Z x0 = x0.contiguous() 2025-05-07T20:32:54.1741407Z x1 = x1.contiguous() 2025-05-07T20:32:54.1741652Z 2025-05-07T20:32:54.1741858Z if scale_ub is not None: 2025-05-07T20:32:54.1742140Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.1742482Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.1742807Z ) 2025-05-07T20:32:54.1743112Z else: 2025-05-07T20:32:54.1743338Z scale_ub_tensor = None 2025-05-07T20:32:54.1743602Z 2025-05-07T20:32:54.1743840Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.1744170Z op = silu_mul_quant 2025-05-07T20:32:54.1744430Z if compiled: 2025-05-07T20:32:54.1744683Z op = torch.compile(op) 2025-05-07T20:32:54.1744989Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.1745277Z 2025-05-07T20:32:54.1745546Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.1745720Z 2025-05-07T20:32:54.1745826Z moe/activation_test.py:117: 2025-05-07T20:32:54.1746139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.1746483Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.1746770Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.1747362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.1747950Z return fn(*args, **kwargs) 2025-05-07T20:32:54.1748704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.1749434Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.1749998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.1750721Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.1751423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.1751989Z kernel = self.compile( 2025-05-07T20:32:54.1752621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.1753300Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.1753716Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.1753961Z 2025-05-07T20:32:54.1754180Z self = 2025-05-07T20:32:54.1755308Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.1756734Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd871831940>} 2025-05-07T20:32:54.1758140Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.1759258Z context = 2025-05-07T20:32:54.1759565Z 2025-05-07T20:32:54.1759739Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.1760290Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.1760773Z module_map=module_map) 2025-05-07T20:32:54.1761156Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.1761526Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.1761788Z E ^ 2025-05-07T20:32:54.1762278Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.1762756Z 2025-05-07T20:32:54.1763188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.1763726Z 2025-05-07T20:32:54.1763836Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.1764309Z self=, 2025-05-07T20:32:54.1764731Z T=128, 2025-05-07T20:32:54.1764921Z D=7168, 2025-05-07T20:32:54.1765116Z scale_ub=1200.0, 2025-05-07T20:32:54.1765349Z contiguous=True, 2025-05-07T20:32:54.1765578Z compiled=False, 2025-05-07T20:32:54.1765784Z ) 2025-05-07T20:32:54.1766120Z self = 2025-05-07T20:32:54.1766633Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.1766985Z 2025-05-07T20:32:54.1767074Z @given( 2025-05-07T20:32:54.1767305Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.1767626Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.1767948Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.1768286Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.1768628Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.1768921Z ) 2025-05-07T20:32:54.1769280Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.1769742Z def test_silu_mul_quant( 2025-05-07T20:32:54.1769992Z self, 2025-05-07T20:32:54.1770232Z T: int, 2025-05-07T20:32:54.1770438Z D: int, 2025-05-07T20:32:54.1770657Z scale_ub: Optional[float], 2025-05-07T20:32:54.1770934Z contiguous: bool, 2025-05-07T20:32:54.1771178Z compiled: bool, 2025-05-07T20:32:54.1771406Z ) -> None: 2025-05-07T20:32:54.1771631Z torch.manual_seed(2025) 2025-05-07T20:32:54.1771952Z 2025-05-07T20:32:54.1772229Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.1772580Z 2025-05-07T20:32:54.1772769Z x_sign = torch.sign(x) 2025-05-07T20:32:54.1773117Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.1775216Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.1777150Z 2025-05-07T20:32:54.1777274Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:54.1777496Z 2025-05-07T20:32:54.1777611Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.1778033Z self=, 2025-05-07T20:32:54.1778455Z T=128, 2025-05-07T20:32:54.1778643Z D=5120, 2025-05-07T20:32:54.1778833Z scale_ub=1200.0, 2025-05-07T20:32:54.1779061Z contiguous=True, 2025-05-07T20:32:54.1779291Z compiled=True, 2025-05-07T20:32:54.1779493Z ) 2025-05-07T20:32:54.1779822Z self = 2025-05-07T20:32:54.1780335Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.1780611Z 2025-05-07T20:32:54.1780689Z @given( 2025-05-07T20:32:54.1780924Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.1781245Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.1781558Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.1781894Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.1782232Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.1782521Z ) 2025-05-07T20:32:54.1782877Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.1783337Z def test_silu_mul_quant( 2025-05-07T20:32:54.1783588Z self, 2025-05-07T20:32:54.1783784Z T: int, 2025-05-07T20:32:54.1783985Z D: int, 2025-05-07T20:32:54.1784255Z scale_ub: Optional[float], 2025-05-07T20:32:54.1784528Z contiguous: bool, 2025-05-07T20:32:54.1784773Z compiled: bool, 2025-05-07T20:32:54.1785006Z ) -> None: 2025-05-07T20:32:54.1785223Z torch.manual_seed(2025) 2025-05-07T20:32:54.1785470Z 2025-05-07T20:32:54.1785749Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.1786092Z 2025-05-07T20:32:54.1786283Z x_sign = torch.sign(x) 2025-05-07T20:32:54.1786582Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.1788754Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.1790680Z 2025-05-07T20:32:54.1790847Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:54.1791066Z 2025-05-07T20:32:54.1791170Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.1791596Z self=, 2025-05-07T20:32:54.1792008Z T=128, 2025-05-07T20:32:54.1792198Z D=7168, 2025-05-07T20:32:54.1792389Z scale_ub=None, 2025-05-07T20:32:54.1792607Z contiguous=True, 2025-05-07T20:32:54.1792831Z compiled=True, 2025-05-07T20:32:54.1793033Z ) 2025-05-07T20:32:54.4301269Z self = 2025-05-07T20:32:54.4302100Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.4302378Z 2025-05-07T20:32:54.4302456Z @given( 2025-05-07T20:32:54.4302694Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4303012Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4303325Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4303654Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4303989Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4304274Z ) 2025-05-07T20:32:54.4304628Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4305091Z def test_silu_mul_quant( 2025-05-07T20:32:54.4305343Z self, 2025-05-07T20:32:54.4305538Z T: int, 2025-05-07T20:32:54.4305740Z D: int, 2025-05-07T20:32:54.4305969Z scale_ub: Optional[float], 2025-05-07T20:32:54.4306404Z contiguous: bool, 2025-05-07T20:32:54.4306657Z compiled: bool, 2025-05-07T20:32:54.4306891Z ) -> None: 2025-05-07T20:32:54.4307106Z torch.manual_seed(2025) 2025-05-07T20:32:54.4307366Z 2025-05-07T20:32:54.4307654Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4309794Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.4311731Z 2025-05-07T20:32:54.4311855Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.4312077Z 2025-05-07T20:32:54.4327989Z FAILED 2025-05-07T20:32:54.4328487Z 2025-05-07T20:32:54.4328923Z =================================== FAILURES =================================== 2025-05-07T20:32:54.4329831Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:54.4330493Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:54.4331422Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:32:54.4332324Z | yield 2025-05-07T20:32:54.4332968Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run 2025-05-07T20:32:54.4333872Z | self._callTestMethod(testMethod) 2025-05-07T20:32:54.4334716Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod 2025-05-07T20:32:54.4335546Z | if method() is not None: 2025-05-07T20:32:54.4336111Z | ^^^^^^^^ 2025-05-07T20:32:54.4337062Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:54.4338172Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4338601Z | ^^^^^^^ 2025-05-07T20:32:54.4339573Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:54.4340524Z | raise the_error_hypothesis_found 2025-05-07T20:32:54.4341149Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:54.4341773Z +-+---------------- 1 ---------------- 2025-05-07T20:32:54.4342198Z | Traceback (most recent call last): 2025-05-07T20:32:54.4343268Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:54.4344559Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4358419Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.4361505Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.4364559Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:54.4365201Z | self=, 2025-05-07T20:32:54.4365809Z | T=2048, 2025-05-07T20:32:54.4366149Z | D=5120, # or any other generated value 2025-05-07T20:32:54.4366643Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:54.4367165Z | contiguous=True, # or any other generated value 2025-05-07T20:32:54.4367706Z | compiled=False, # or any other generated value 2025-05-07T20:32:54.4368161Z | ) 2025-05-07T20:32:54.4368457Z | 2025-05-07T20:32:54.4369251Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:54.4370173Z +---------------- 2 ---------------- 2025-05-07T20:32:54.4370595Z | Traceback (most recent call last): 2025-05-07T20:32:54.4371656Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:54.4372936Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4373491Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.4376624Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.4379678Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:54.4380387Z | self=, 2025-05-07T20:32:54.4380939Z | T=128, 2025-05-07T20:32:54.4381149Z | D=7168, 2025-05-07T20:32:54.4381359Z | scale_ub=None, 2025-05-07T20:32:54.4381602Z | contiguous=True, 2025-05-07T20:32:54.4381847Z | compiled=True, 2025-05-07T20:32:54.4382067Z | ) 2025-05-07T20:32:54.4382252Z | 2025-05-07T20:32:54.4382798Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:54.4383482Z +---------------- 3 ---------------- 2025-05-07T20:32:54.4383777Z | Traceback (most recent call last): 2025-05-07T20:32:54.4384512Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:54.4385318Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4385704Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.4387932Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.4390095Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:54.4390551Z | self=, 2025-05-07T20:32:54.4390976Z | T=128, 2025-05-07T20:32:54.4391183Z | D=5120, 2025-05-07T20:32:54.4391402Z | scale_ub=1200.0, 2025-05-07T20:32:54.4391653Z | contiguous=True, 2025-05-07T20:32:54.4391894Z | compiled=True, 2025-05-07T20:32:54.4392125Z | ) 2025-05-07T20:32:54.4392313Z | 2025-05-07T20:32:54.4392848Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:54.4393481Z +---------------- 4 ---------------- 2025-05-07T20:32:54.4393782Z | Traceback (most recent call last): 2025-05-07T20:32:54.4394523Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:54.4395255Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.4395555Z | ^^^^^^^^ 2025-05-07T20:32:54.4396217Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:54.4396943Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.4397282Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.4398119Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:54.4399071Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.4399698Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:54.4400467Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.4400935Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.4401601Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:54.4402447Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.4402930Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.4403600Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:54.4404331Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.4404744Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.4405666Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:54.4406746Z | fn() 2025-05-07T20:32:54.4407742Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:54.4408668Z | self.fn.run( 2025-05-07T20:32:54.4409426Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:54.4410409Z | kernel = self.compile( 2025-05-07T20:32:54.4410770Z | ^^^^^^^^^^^^^ 2025-05-07T20:32:54.4411629Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:54.4412790Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.4413341Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.4414258Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:54.4415394Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.4416073Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.4416576Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.4417057Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.4417435Z | ^ 2025-05-07T20:32:54.4418089Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.4418907Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:54.4419470Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:54.4420215Z | self=, 2025-05-07T20:32:54.4420828Z | T=1, # or any other generated value 2025-05-07T20:32:54.4421270Z | D=5120, # or any other generated value 2025-05-07T20:32:54.4421749Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:54.4422272Z | contiguous=True, # or any other generated value 2025-05-07T20:32:54.4422785Z | compiled=True, # or any other generated value 2025-05-07T20:32:54.4423223Z | ) 2025-05-07T20:32:54.4423477Z | 2025-05-07T20:32:54.4424226Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:54.4425109Z +------------------------------------ 2025-05-07T20:32:54.4425725Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:54.4426258Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4426843Z self=, 2025-05-07T20:32:54.4427421Z T=1, 2025-05-07T20:32:54.4427683Z D=5120, 2025-05-07T20:32:54.4427948Z scale_ub=None, 2025-05-07T20:32:54.4428266Z contiguous=True, 2025-05-07T20:32:54.4428618Z compiled=True, 2025-05-07T20:32:54.4428913Z ) 2025-05-07T20:32:54.4429500Z self = 2025-05-07T20:32:54.4430183Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.4430551Z 2025-05-07T20:32:54.4430659Z @given( 2025-05-07T20:32:54.4430982Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4431424Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4431841Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4432308Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4432787Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4433194Z ) 2025-05-07T20:32:54.4433778Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4434420Z def test_silu_mul_quant( 2025-05-07T20:32:54.4434768Z self, 2025-05-07T20:32:54.4435034Z T: int, 2025-05-07T20:32:54.4435322Z D: int, 2025-05-07T20:32:54.4435631Z scale_ub: Optional[float], 2025-05-07T20:32:54.4436013Z contiguous: bool, 2025-05-07T20:32:54.4436353Z compiled: bool, 2025-05-07T20:32:54.4436679Z ) -> None: 2025-05-07T20:32:54.4436974Z torch.manual_seed(2025) 2025-05-07T20:32:54.4438076Z 2025-05-07T20:32:54.4438459Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4438940Z 2025-05-07T20:32:54.4439212Z x_sign = torch.sign(x) 2025-05-07T20:32:54.4439634Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.4440068Z x = x_sign * x_clamp 2025-05-07T20:32:54.4440419Z x0 = x[:, :D] 2025-05-07T20:32:54.4440718Z x1 = x[:, D:] 2025-05-07T20:32:54.4440999Z 2025-05-07T20:32:54.4441257Z if contiguous: 2025-05-07T20:32:54.4441581Z x0 = x0.contiguous() 2025-05-07T20:32:54.4441944Z x1 = x1.contiguous() 2025-05-07T20:32:54.4442274Z 2025-05-07T20:32:54.4442533Z if scale_ub is not None: 2025-05-07T20:32:54.4442912Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.4443359Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.4443793Z ) 2025-05-07T20:32:54.4444055Z else: 2025-05-07T20:32:54.4444339Z scale_ub_tensor = None 2025-05-07T20:32:54.4444680Z 2025-05-07T20:32:54.4445006Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4445441Z op = silu_mul_quant 2025-05-07T20:32:54.4445805Z if compiled: 2025-05-07T20:32:54.4446149Z op = torch.compile(op) 2025-05-07T20:32:54.4446562Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4446961Z 2025-05-07T20:32:54.4447234Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.4447628Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.4448039Z 2025-05-07T20:32:54.4448419Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4448897Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.4449310Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.4449758Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.4450275Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.4450719Z 2025-05-07T20:32:54.4451003Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.4451282Z 2025-05-07T20:32:54.4451443Z moe/activation_test.py:126: 2025-05-07T20:32:54.4452038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4452527Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.4452991Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.4454105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.4455144Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.4455890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.4456879Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.4457857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.4458919Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.4459983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.4460938Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.4461773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.4462494Z fn() 2025-05-07T20:32:54.4463195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.4464040Z self.fn.run( 2025-05-07T20:32:54.4464688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.4465429Z kernel = self.compile( 2025-05-07T20:32:54.4466233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.4467134Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.4467691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4468008Z 2025-05-07T20:32:54.4468294Z self = 2025-05-07T20:32:54.4469805Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.4471814Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb33e99c60>} 2025-05-07T20:32:54.4473764Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.4475247Z context = 2025-05-07T20:32:54.4475652Z 2025-05-07T20:32:54.4475878Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.4476616Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.4477263Z module_map=module_map) 2025-05-07T20:32:54.4477765Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.4478240Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.4478605Z E ^ 2025-05-07T20:32:54.4479255Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.4479885Z 2025-05-07T20:32:54.4480484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.4481227Z 2025-05-07T20:32:54.4481364Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4481990Z self=, 2025-05-07T20:32:54.4482547Z T=2048, 2025-05-07T20:32:54.4482800Z D=5120, 2025-05-07T20:32:54.4483066Z scale_ub=1200.0, 2025-05-07T20:32:54.4483376Z contiguous=True, 2025-05-07T20:32:54.4483674Z compiled=False, 2025-05-07T20:32:54.4483959Z ) 2025-05-07T20:32:54.4484400Z self = 2025-05-07T20:32:54.4485072Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.4485521Z 2025-05-07T20:32:54.4485628Z @given( 2025-05-07T20:32:54.4485947Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4486385Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4486806Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4487267Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4487731Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4488121Z ) 2025-05-07T20:32:54.4488620Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4489305Z def test_silu_mul_quant( 2025-05-07T20:32:54.4489646Z self, 2025-05-07T20:32:54.4489923Z T: int, 2025-05-07T20:32:54.4490203Z D: int, 2025-05-07T20:32:54.4490496Z scale_ub: Optional[float], 2025-05-07T20:32:54.4490870Z contiguous: bool, 2025-05-07T20:32:54.4491202Z compiled: bool, 2025-05-07T20:32:54.4491516Z ) -> None: 2025-05-07T20:32:54.4491917Z torch.manual_seed(2025) 2025-05-07T20:32:54.4492267Z 2025-05-07T20:32:54.4492651Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4493176Z 2025-05-07T20:32:54.4493442Z x_sign = torch.sign(x) 2025-05-07T20:32:54.4493834Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.4494256Z x = x_sign * x_clamp 2025-05-07T20:32:54.4494574Z x0 = x[:, :D] 2025-05-07T20:32:54.4494878Z x1 = x[:, D:] 2025-05-07T20:32:54.4495156Z 2025-05-07T20:32:54.4495414Z if contiguous: 2025-05-07T20:32:54.4495741Z x0 = x0.contiguous() 2025-05-07T20:32:54.4496086Z x1 = x1.contiguous() 2025-05-07T20:32:54.4496413Z 2025-05-07T20:32:54.4496676Z if scale_ub is not None: 2025-05-07T20:32:54.4497039Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.4497486Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.4497909Z ) 2025-05-07T20:32:54.4498157Z else: 2025-05-07T20:32:54.4498443Z scale_ub_tensor = None 2025-05-07T20:32:54.4498792Z 2025-05-07T20:32:54.4499096Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4499538Z op = silu_mul_quant 2025-05-07T20:32:54.4499879Z if compiled: 2025-05-07T20:32:54.4500210Z op = torch.compile(op) 2025-05-07T20:32:54.4500606Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4500984Z 2025-05-07T20:32:54.4501243Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.4501465Z 2025-05-07T20:32:54.4501603Z moe/activation_test.py:117: 2025-05-07T20:32:54.4502013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4502470Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.4502845Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4503822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.4504820Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.4505589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.4506868Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.4507894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.4508614Z kernel = self.compile( 2025-05-07T20:32:54.4509333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.4510211Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.4510731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4511046Z 2025-05-07T20:32:54.4511341Z self = 2025-05-07T20:32:54.4512942Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.4514953Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb33cf0220>} 2025-05-07T20:32:54.4516970Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.4518423Z context = 2025-05-07T20:32:54.4518808Z 2025-05-07T20:32:54.4519032Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.4519732Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.4520372Z module_map=module_map) 2025-05-07T20:32:54.4520856Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.4521393Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.4521735Z E ^ 2025-05-07T20:32:54.4522365Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.4522976Z 2025-05-07T20:32:54.4523543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.4524234Z 2025-05-07T20:32:54.4524364Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4524908Z self=, 2025-05-07T20:32:54.4525446Z T=2048, 2025-05-07T20:32:54.4525687Z D=5120, 2025-05-07T20:32:54.4525941Z scale_ub=1200.0, 2025-05-07T20:32:54.4526228Z contiguous=True, 2025-05-07T20:32:54.4526519Z compiled=True, 2025-05-07T20:32:54.4526776Z ) 2025-05-07T20:32:54.4527199Z self = 2025-05-07T20:32:54.4527866Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.4528232Z 2025-05-07T20:32:54.4528332Z @given( 2025-05-07T20:32:54.4528654Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4529121Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4529555Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4530000Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4530457Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4530843Z ) 2025-05-07T20:32:54.4531329Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4532036Z def test_silu_mul_quant( 2025-05-07T20:32:54.4532373Z self, 2025-05-07T20:32:54.4532629Z T: int, 2025-05-07T20:32:54.4532886Z D: int, 2025-05-07T20:32:54.4533014Z scale_ub: Optional[float], 2025-05-07T20:32:54.4533146Z contiguous: bool, 2025-05-07T20:32:54.4533263Z compiled: bool, 2025-05-07T20:32:54.4533365Z ) -> None: 2025-05-07T20:32:54.4533497Z torch.manual_seed(2025) 2025-05-07T20:32:54.4533596Z 2025-05-07T20:32:54.4533941Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4534053Z 2025-05-07T20:32:54.4534175Z x_sign = torch.sign(x) 2025-05-07T20:32:54.4534343Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.4534475Z x = x_sign * x_clamp 2025-05-07T20:32:54.4534585Z x0 = x[:, :D] 2025-05-07T20:32:54.4534699Z x1 = x[:, D:] 2025-05-07T20:32:54.4534796Z 2025-05-07T20:32:54.4534910Z if contiguous: 2025-05-07T20:32:54.4535039Z x0 = x0.contiguous() 2025-05-07T20:32:54.4535225Z x1 = x1.contiguous() 2025-05-07T20:32:54.4535324Z 2025-05-07T20:32:54.4535455Z if scale_ub is not None: 2025-05-07T20:32:54.4535593Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.4535780Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.4535887Z ) 2025-05-07T20:32:54.4535986Z else: 2025-05-07T20:32:54.4536109Z scale_ub_tensor = None 2025-05-07T20:32:54.4536219Z 2025-05-07T20:32:54.4536393Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4536527Z op = silu_mul_quant 2025-05-07T20:32:54.4536720Z if compiled: 2025-05-07T20:32:54.4536857Z op = torch.compile(op) 2025-05-07T20:32:54.4537009Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4537107Z 2025-05-07T20:32:54.4537225Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.4537393Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.4537491Z 2025-05-07T20:32:54.4537676Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4537823Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.4538022Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.4538189Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.4538387Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.4538492Z 2025-05-07T20:32:54.4538635Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.4538642Z 2025-05-07T20:32:54.4538775Z moe/activation_test.py:126: 2025-05-07T20:32:54.4538953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4539101Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.4539283Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.4540080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.4540233Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.4540739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.4541063Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.4541578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.4541925Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.4542464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.4542684Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.4543169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.4543282Z fn() 2025-05-07T20:32:54.4543847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.4543973Z self.fn.run( 2025-05-07T20:32:54.4544433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.4544555Z kernel = self.compile( 2025-05-07T20:32:54.4545163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.4545406Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.4545594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4545601Z 2025-05-07T20:32:54.4545883Z self = 2025-05-07T20:32:54.4546988Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.4547775Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb33cf16c0>} 2025-05-07T20:32:54.4548843Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.4549251Z context = 2025-05-07T20:32:54.4549259Z 2025-05-07T20:32:54.4549492Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.4549867Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.4550016Z module_map=module_map) 2025-05-07T20:32:54.4550236Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.4550380Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.4550486Z E ^ 2025-05-07T20:32:54.4551067Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.4551074Z 2025-05-07T20:32:54.4551689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.4551696Z 2025-05-07T20:32:54.4551837Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4552166Z self=, 2025-05-07T20:32:54.4552272Z T=16384, 2025-05-07T20:32:54.4552375Z D=7168, 2025-05-07T20:32:54.4552495Z scale_ub=1200.0, 2025-05-07T20:32:54.4552608Z contiguous=False, 2025-05-07T20:32:54.4552719Z compiled=False, 2025-05-07T20:32:54.4552825Z ) 2025-05-07T20:32:54.4553125Z self = 2025-05-07T20:32:54.4553372Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.4553382Z 2025-05-07T20:32:54.4553488Z @given( 2025-05-07T20:32:54.4553644Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4553781Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4553937Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4554097Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4554256Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4554359Z ) 2025-05-07T20:32:54.4554703Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4554835Z def test_silu_mul_quant( 2025-05-07T20:32:54.4554928Z self, 2025-05-07T20:32:54.4555031Z T: int, 2025-05-07T20:32:54.4555132Z D: int, 2025-05-07T20:32:54.4555249Z scale_ub: Optional[float], 2025-05-07T20:32:54.4555354Z contiguous: bool, 2025-05-07T20:32:54.4555467Z compiled: bool, 2025-05-07T20:32:54.4555562Z ) -> None: 2025-05-07T20:32:54.4555690Z torch.manual_seed(2025) 2025-05-07T20:32:54.4555788Z 2025-05-07T20:32:54.4556003Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4556111Z 2025-05-07T20:32:54.4570404Z x_sign = torch.sign(x) 2025-05-07T20:32:54.4570576Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.4570663Z x = x_sign * x_clamp 2025-05-07T20:32:54.4570754Z x0 = x[:, :D] 2025-05-07T20:32:54.4570832Z x1 = x[:, D:] 2025-05-07T20:32:54.4570901Z 2025-05-07T20:32:54.4570990Z if contiguous: 2025-05-07T20:32:54.4571082Z x0 = x0.contiguous() 2025-05-07T20:32:54.4571165Z x1 = x1.contiguous() 2025-05-07T20:32:54.4571238Z 2025-05-07T20:32:54.4571369Z if scale_ub is not None: 2025-05-07T20:32:54.4571469Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.4571610Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.4571684Z ) 2025-05-07T20:32:54.4571763Z else: 2025-05-07T20:32:54.4571976Z scale_ub_tensor = None 2025-05-07T20:32:54.4572050Z 2025-05-07T20:32:54.4572183Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4572287Z op = silu_mul_quant 2025-05-07T20:32:54.4572371Z if compiled: 2025-05-07T20:32:54.4572473Z op = torch.compile(op) 2025-05-07T20:32:54.4572623Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4572694Z 2025-05-07T20:32:54.4572789Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.4572795Z 2025-05-07T20:32:54.4572892Z moe/activation_test.py:117: 2025-05-07T20:32:54.4573024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4573130Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.4573227Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4573754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.4573906Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.4574279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.4574516Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.4574869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.4574965Z kernel = self.compile( 2025-05-07T20:32:54.4575413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.4575607Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.4575750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4575755Z 2025-05-07T20:32:54.4575985Z self = 2025-05-07T20:32:54.4576964Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.4577592Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb32b9cea0>} 2025-05-07T20:32:54.4578516Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.4578735Z context = 2025-05-07T20:32:54.4578742Z 2025-05-07T20:32:54.4578925Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.4579236Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.4579352Z module_map=module_map) 2025-05-07T20:32:54.4579527Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.4579677Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.4579754Z E ^ 2025-05-07T20:32:54.4580183Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.4580188Z 2025-05-07T20:32:54.4580689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.4580694Z 2025-05-07T20:32:54.4580806Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4581077Z self=, 2025-05-07T20:32:54.4581153Z T=1, 2025-05-07T20:32:54.4581229Z D=7168, 2025-05-07T20:32:54.4581314Z scale_ub=None, 2025-05-07T20:32:54.4581402Z contiguous=True, 2025-05-07T20:32:54.4581485Z compiled=True, 2025-05-07T20:32:54.4581564Z ) 2025-05-07T20:32:54.4581786Z self = 2025-05-07T20:32:54.4581953Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.4581957Z 2025-05-07T20:32:54.4582039Z @given( 2025-05-07T20:32:54.4582202Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4582311Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4582428Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4582547Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4582668Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4582746Z ) 2025-05-07T20:32:54.4582999Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4583100Z def test_silu_mul_quant( 2025-05-07T20:32:54.4583177Z self, 2025-05-07T20:32:54.4583326Z T: int, 2025-05-07T20:32:54.4583413Z D: int, 2025-05-07T20:32:54.4583513Z scale_ub: Optional[float], 2025-05-07T20:32:54.4583603Z contiguous: bool, 2025-05-07T20:32:54.4583698Z compiled: bool, 2025-05-07T20:32:54.4583778Z ) -> None: 2025-05-07T20:32:54.4583879Z torch.manual_seed(2025) 2025-05-07T20:32:54.4583952Z 2025-05-07T20:32:54.4584127Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4584209Z 2025-05-07T20:32:54.4584303Z x_sign = torch.sign(x) 2025-05-07T20:32:54.4584429Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.4584525Z x = x_sign * x_clamp 2025-05-07T20:32:54.4584607Z x0 = x[:, :D] 2025-05-07T20:32:54.4584690Z x1 = x[:, D:] 2025-05-07T20:32:54.4584770Z 2025-05-07T20:32:54.4584854Z if contiguous: 2025-05-07T20:32:54.4584947Z x0 = x0.contiguous() 2025-05-07T20:32:54.4585042Z x1 = x1.contiguous() 2025-05-07T20:32:54.4585117Z 2025-05-07T20:32:54.4585206Z if scale_ub is not None: 2025-05-07T20:32:54.4585317Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.4585457Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.4585541Z ) 2025-05-07T20:32:54.4585620Z else: 2025-05-07T20:32:54.4585714Z scale_ub_tensor = None 2025-05-07T20:32:54.4585798Z 2025-05-07T20:32:54.4585929Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4586020Z op = silu_mul_quant 2025-05-07T20:32:54.4586113Z if compiled: 2025-05-07T20:32:54.4586216Z op = torch.compile(op) 2025-05-07T20:32:54.4586323Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4586404Z 2025-05-07T20:32:54.4586496Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.4586623Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.4586701Z 2025-05-07T20:32:54.4586838Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4586948Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.4587050Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.4587221Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.4587372Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.4587450Z 2025-05-07T20:32:54.4587550Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.4587555Z 2025-05-07T20:32:54.4587660Z moe/activation_test.py:126: 2025-05-07T20:32:54.4587794Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4587909Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.4588094Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.4588676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.4588790Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.4589161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.4589391Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.4589818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.4590083Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.4590484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.4590656Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.4591011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.4591136Z fn() 2025-05-07T20:32:54.4591553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.4591648Z self.fn.run( 2025-05-07T20:32:54.4592002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.4592099Z kernel = self.compile( 2025-05-07T20:32:54.4592499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.4592679Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.4592813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4592821Z 2025-05-07T20:32:54.4593040Z self = 2025-05-07T20:32:54.4593852Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.4594384Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb32bc6f20>} 2025-05-07T20:32:54.4595158Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.4595361Z context = 2025-05-07T20:32:54.4595366Z 2025-05-07T20:32:54.4595534Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.4595808Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.4595922Z module_map=module_map) 2025-05-07T20:32:54.4596088Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.4596191Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.4596278Z E ^ 2025-05-07T20:32:54.4596690Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.4596695Z 2025-05-07T20:32:54.4597135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.4597139Z 2025-05-07T20:32:54.4597244Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4597474Z self=, 2025-05-07T20:32:54.4597557Z T=4096, 2025-05-07T20:32:54.4597675Z D=5120, 2025-05-07T20:32:54.4597759Z scale_ub=None, 2025-05-07T20:32:54.4597852Z contiguous=False, 2025-05-07T20:32:54.4597937Z compiled=False, 2025-05-07T20:32:54.4598018Z ) 2025-05-07T20:32:54.4598244Z self = 2025-05-07T20:32:54.4598430Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.4598435Z 2025-05-07T20:32:54.4598518Z @given( 2025-05-07T20:32:54.4598642Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4598741Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4598906Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4599026Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4599142Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4599224Z ) 2025-05-07T20:32:54.4599479Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4599585Z def test_silu_mul_quant( 2025-05-07T20:32:54.4599665Z self, 2025-05-07T20:32:54.4599743Z T: int, 2025-05-07T20:32:54.4599824Z D: int, 2025-05-07T20:32:54.4599923Z scale_ub: Optional[float], 2025-05-07T20:32:54.4600056Z contiguous: bool, 2025-05-07T20:32:54.4600148Z compiled: bool, 2025-05-07T20:32:54.4600226Z ) -> None: 2025-05-07T20:32:54.4600320Z torch.manual_seed(2025) 2025-05-07T20:32:54.4600400Z 2025-05-07T20:32:54.4600573Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4600649Z 2025-05-07T20:32:54.4600749Z x_sign = torch.sign(x) 2025-05-07T20:32:54.4600879Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.4600974Z x = x_sign * x_clamp 2025-05-07T20:32:54.4601055Z x0 = x[:, :D] 2025-05-07T20:32:54.4601137Z x1 = x[:, D:] 2025-05-07T20:32:54.4601218Z 2025-05-07T20:32:54.4601304Z if contiguous: 2025-05-07T20:32:54.4601398Z x0 = x0.contiguous() 2025-05-07T20:32:54.4601496Z x1 = x1.contiguous() 2025-05-07T20:32:54.4601569Z 2025-05-07T20:32:54.4601660Z if scale_ub is not None: 2025-05-07T20:32:54.4601775Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.4601915Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.4601991Z ) 2025-05-07T20:32:54.4602074Z else: 2025-05-07T20:32:54.4602173Z scale_ub_tensor = None 2025-05-07T20:32:54.4602247Z 2025-05-07T20:32:54.4602391Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4602488Z op = silu_mul_quant 2025-05-07T20:32:54.4602581Z if compiled: 2025-05-07T20:32:54.4602681Z op = torch.compile(op) 2025-05-07T20:32:54.4602787Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4602867Z 2025-05-07T20:32:54.4602959Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.4602965Z 2025-05-07T20:32:54.4603065Z moe/activation_test.py:117: 2025-05-07T20:32:54.4603209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4603311Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.4603414Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4603939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.4604083Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.4604461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.4604692Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.4605047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.4605150Z kernel = self.compile( 2025-05-07T20:32:54.4605546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.4605790Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.4605924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4605932Z 2025-05-07T20:32:54.4606434Z self = 2025-05-07T20:32:54.4607448Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.4607979Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb32bc7ec0>} 2025-05-07T20:32:54.4608762Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.4608988Z context = 2025-05-07T20:32:54.4609069Z 2025-05-07T20:32:54.4609257Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.4609537Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.4609647Z module_map=module_map) 2025-05-07T20:32:54.4609816Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.4609918Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.4609994Z E ^ 2025-05-07T20:32:54.4610371Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.4610376Z 2025-05-07T20:32:54.4610806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.4610814Z 2025-05-07T20:32:54.4610926Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4611159Z self=, 2025-05-07T20:32:54.4611238Z T=4096, 2025-05-07T20:32:54.4611319Z D=7168, 2025-05-07T20:32:54.4611401Z scale_ub=None, 2025-05-07T20:32:54.4611487Z contiguous=False, 2025-05-07T20:32:54.4611578Z compiled=False, 2025-05-07T20:32:54.4611651Z ) 2025-05-07T20:32:54.4611944Z self = 2025-05-07T20:32:54.4612135Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.4612140Z 2025-05-07T20:32:54.4612218Z @given( 2025-05-07T20:32:54.4612345Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4612444Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4612556Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4612682Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4612795Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4612868Z ) 2025-05-07T20:32:54.4613128Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4613220Z def test_silu_mul_quant( 2025-05-07T20:32:54.4613296Z self, 2025-05-07T20:32:54.4613381Z T: int, 2025-05-07T20:32:54.4613538Z D: int, 2025-05-07T20:32:54.4613645Z scale_ub: Optional[float], 2025-05-07T20:32:54.4613733Z contiguous: bool, 2025-05-07T20:32:54.4613820Z compiled: bool, 2025-05-07T20:32:54.4613905Z ) -> None: 2025-05-07T20:32:54.4614000Z torch.manual_seed(2025) 2025-05-07T20:32:54.4614072Z 2025-05-07T20:32:54.4614250Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4614324Z 2025-05-07T20:32:54.4614415Z x_sign = torch.sign(x) 2025-05-07T20:32:54.4614616Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.4614704Z x = x_sign * x_clamp 2025-05-07T20:32:54.4614785Z x0 = x[:, :D] 2025-05-07T20:32:54.4614869Z x1 = x[:, D:] 2025-05-07T20:32:54.4614944Z 2025-05-07T20:32:54.4615026Z if contiguous: 2025-05-07T20:32:54.4615121Z x0 = x0.contiguous() 2025-05-07T20:32:54.4615212Z x1 = x1.contiguous() 2025-05-07T20:32:54.4615290Z 2025-05-07T20:32:54.4615382Z if scale_ub is not None: 2025-05-07T20:32:54.4615487Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.4615695Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.4615773Z ) 2025-05-07T20:32:54.4615850Z else: 2025-05-07T20:32:54.4615951Z scale_ub_tensor = None 2025-05-07T20:32:54.4616023Z 2025-05-07T20:32:54.4616151Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4616252Z op = silu_mul_quant 2025-05-07T20:32:54.4616335Z if compiled: 2025-05-07T20:32:54.4616434Z op = torch.compile(op) 2025-05-07T20:32:54.4616549Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4616663Z 2025-05-07T20:32:54.4616763Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.4616767Z 2025-05-07T20:32:54.4616865Z moe/activation_test.py:117: 2025-05-07T20:32:54.4616999Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4617105Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.4617205Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4617724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.4617829Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.4618199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.4618438Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.4618790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.4618887Z kernel = self.compile( 2025-05-07T20:32:54.4619289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.4619469Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.4619600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4619614Z 2025-05-07T20:32:54.4619829Z self = 2025-05-07T20:32:54.4620634Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.4621166Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb32bad620>} 2025-05-07T20:32:54.4621940Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.4622187Z context = 2025-05-07T20:32:54.4622192Z 2025-05-07T20:32:54.4622367Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.4622640Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.4622753Z module_map=module_map) 2025-05-07T20:32:54.4622919Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.4623064Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.4623142Z E ^ 2025-05-07T20:32:54.4623508Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.4623516Z 2025-05-07T20:32:54.4623949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.4623954Z 2025-05-07T20:32:54.4624061Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4624290Z self=, 2025-05-07T20:32:54.4624375Z T=128, 2025-05-07T20:32:54.4624493Z D=7168, 2025-05-07T20:32:54.4624584Z scale_ub=None, 2025-05-07T20:32:54.4624672Z contiguous=False, 2025-05-07T20:32:54.4624756Z compiled=True, 2025-05-07T20:32:54.4624838Z ) 2025-05-07T20:32:54.4625062Z self = 2025-05-07T20:32:54.4625239Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.4625243Z 2025-05-07T20:32:54.4625328Z @given( 2025-05-07T20:32:54.4625448Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4625606Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4625729Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4625844Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4625969Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4626043Z ) 2025-05-07T20:32:54.4626295Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4626394Z def test_silu_mul_quant( 2025-05-07T20:32:54.4626469Z self, 2025-05-07T20:32:54.4626548Z T: int, 2025-05-07T20:32:54.4626630Z D: int, 2025-05-07T20:32:54.4626729Z scale_ub: Optional[float], 2025-05-07T20:32:54.4626817Z contiguous: bool, 2025-05-07T20:32:54.4626922Z compiled: bool, 2025-05-07T20:32:54.4627002Z ) -> None: 2025-05-07T20:32:54.4627106Z torch.manual_seed(2025) 2025-05-07T20:32:54.4627180Z 2025-05-07T20:32:54.4627352Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4627437Z 2025-05-07T20:32:54.4627529Z x_sign = torch.sign(x) 2025-05-07T20:32:54.4627654Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.4627749Z x = x_sign * x_clamp 2025-05-07T20:32:54.4627832Z x0 = x[:, :D] 2025-05-07T20:32:54.4627920Z x1 = x[:, D:] 2025-05-07T20:32:54.4627992Z 2025-05-07T20:32:54.4628076Z if contiguous: 2025-05-07T20:32:54.4628186Z x0 = x0.contiguous() 2025-05-07T20:32:54.4628289Z x1 = x1.contiguous() 2025-05-07T20:32:54.4628372Z 2025-05-07T20:32:54.4628484Z if scale_ub is not None: 2025-05-07T20:32:54.4628589Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.4628727Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.4628810Z ) 2025-05-07T20:32:54.4628886Z else: 2025-05-07T20:32:54.4628980Z scale_ub_tensor = None 2025-05-07T20:32:54.4629061Z 2025-05-07T20:32:54.4629193Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4629295Z op = silu_mul_quant 2025-05-07T20:32:54.4629378Z if compiled: 2025-05-07T20:32:54.4629476Z op = torch.compile(op) 2025-05-07T20:32:54.4629636Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4629710Z 2025-05-07T20:32:54.4629800Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.4629931Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.4630006Z 2025-05-07T20:32:54.4630143Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4630251Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.4630351Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.4630517Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.4630669Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.4630741Z 2025-05-07T20:32:54.4630847Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.4630855Z 2025-05-07T20:32:54.4630952Z moe/activation_test.py:126: 2025-05-07T20:32:54.4631083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4631196Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.4631334Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.4631957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.4632066Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.4632436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.4632671Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.4633055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.4633358Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.4633758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.4633927Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.4634290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.4634367Z fn() 2025-05-07T20:32:54.4634783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.4634874Z self.fn.run( 2025-05-07T20:32:54.4635223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.4635319Z kernel = self.compile( 2025-05-07T20:32:54.4635719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.4635901Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.4636038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4636045Z 2025-05-07T20:32:54.4636256Z self = 2025-05-07T20:32:54.4637069Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.4637602Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb32bacd60>} 2025-05-07T20:32:54.4638375Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.4638578Z context = 2025-05-07T20:32:54.4638583Z 2025-05-07T20:32:54.4638799Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.4639116Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.4639233Z module_map=module_map) 2025-05-07T20:32:54.4639394Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.4639503Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.4639580Z E ^ 2025-05-07T20:32:54.4639946Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.4639991Z 2025-05-07T20:32:54.4640425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.4640433Z 2025-05-07T20:32:54.4640539Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4640775Z self=, 2025-05-07T20:32:54.4640855Z T=128, 2025-05-07T20:32:54.4640934Z D=7168, 2025-05-07T20:32:54.4641024Z scale_ub=None, 2025-05-07T20:32:54.4641109Z contiguous=False, 2025-05-07T20:32:54.4641231Z compiled=False, 2025-05-07T20:32:54.4641313Z ) 2025-05-07T20:32:54.4641536Z self = 2025-05-07T20:32:54.4641711Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.4641716Z 2025-05-07T20:32:54.4641800Z @given( 2025-05-07T20:32:54.4641923Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4642029Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4642144Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4642302Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4642423Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4642497Z ) 2025-05-07T20:32:54.4642754Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4642854Z def test_silu_mul_quant( 2025-05-07T20:32:54.4642929Z self, 2025-05-07T20:32:54.4643008Z T: int, 2025-05-07T20:32:54.4643094Z D: int, 2025-05-07T20:32:54.4643193Z scale_ub: Optional[float], 2025-05-07T20:32:54.4643286Z contiguous: bool, 2025-05-07T20:32:54.4643379Z compiled: bool, 2025-05-07T20:32:54.4643458Z ) -> None: 2025-05-07T20:32:54.4643559Z torch.manual_seed(2025) 2025-05-07T20:32:54.4643635Z 2025-05-07T20:32:54.4643805Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4643885Z 2025-05-07T20:32:54.4643978Z x_sign = torch.sign(x) 2025-05-07T20:32:54.4644104Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.4644206Z x = x_sign * x_clamp 2025-05-07T20:32:54.4644285Z x0 = x[:, :D] 2025-05-07T20:32:54.4644362Z x1 = x[:, D:] 2025-05-07T20:32:54.4644442Z 2025-05-07T20:32:54.4644527Z if contiguous: 2025-05-07T20:32:54.4644619Z x0 = x0.contiguous() 2025-05-07T20:32:54.4644714Z x1 = x1.contiguous() 2025-05-07T20:32:54.4644789Z 2025-05-07T20:32:54.4644890Z if scale_ub is not None: 2025-05-07T20:32:54.4644995Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.4645132Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.4645215Z ) 2025-05-07T20:32:54.4645293Z else: 2025-05-07T20:32:54.4645390Z scale_ub_tensor = None 2025-05-07T20:32:54.4645467Z 2025-05-07T20:32:54.4645598Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4645688Z op = silu_mul_quant 2025-05-07T20:32:54.4645782Z if compiled: 2025-05-07T20:32:54.4645882Z op = torch.compile(op) 2025-05-07T20:32:54.4645987Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4646066Z 2025-05-07T20:32:54.4646206Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.4646211Z 2025-05-07T20:32:54.4646321Z moe/activation_test.py:117: 2025-05-07T20:32:54.4646456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4646558Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.4646665Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4647182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.4647343Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.4647721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.4647949Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.4648317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.4648411Z kernel = self.compile( 2025-05-07T20:32:54.4648812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.4649034Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.4649167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4649172Z 2025-05-07T20:32:54.4649387Z self = 2025-05-07T20:32:54.4650193Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.4650761Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0dd92340>} 2025-05-07T20:32:54.4651544Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.4651741Z context = 2025-05-07T20:32:54.4651745Z 2025-05-07T20:32:54.4651997Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.4652269Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.4652379Z module_map=module_map) 2025-05-07T20:32:54.4652555Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.4652655Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.4652735Z E ^ 2025-05-07T20:32:54.4653109Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.4653113Z 2025-05-07T20:32:54.4653545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.4653550Z 2025-05-07T20:32:54.4653663Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4653893Z self=, 2025-05-07T20:32:54.4653970Z T=4096, 2025-05-07T20:32:54.4654053Z D=5120, 2025-05-07T20:32:54.4654137Z scale_ub=1200.0, 2025-05-07T20:32:54.4654229Z contiguous=True, 2025-05-07T20:32:54.4654314Z compiled=False, 2025-05-07T20:32:54.4654389Z ) 2025-05-07T20:32:54.4654620Z self = 2025-05-07T20:32:54.4654801Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.4654808Z 2025-05-07T20:32:54.4654888Z @given( 2025-05-07T20:32:54.4655012Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4655111Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4655275Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4655407Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4655531Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4655614Z ) 2025-05-07T20:32:54.4655901Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4655998Z def test_silu_mul_quant( 2025-05-07T20:32:54.4656082Z self, 2025-05-07T20:32:54.4656161Z T: int, 2025-05-07T20:32:54.4656280Z D: int, 2025-05-07T20:32:54.4656389Z scale_ub: Optional[float], 2025-05-07T20:32:54.4656482Z contiguous: bool, 2025-05-07T20:32:54.4656570Z compiled: bool, 2025-05-07T20:32:54.4656656Z ) -> None: 2025-05-07T20:32:54.4656759Z torch.manual_seed(2025) 2025-05-07T20:32:54.4656834Z 2025-05-07T20:32:54.4657023Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4657098Z 2025-05-07T20:32:54.4657205Z x_sign = torch.sign(x) 2025-05-07T20:32:54.4657336Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.4657427Z x = x_sign * x_clamp 2025-05-07T20:32:54.4657560Z x0 = x[:, :D] 2025-05-07T20:32:54.4657644Z x1 = x[:, D:] 2025-05-07T20:32:54.4657718Z 2025-05-07T20:32:54.4657810Z if contiguous: 2025-05-07T20:32:54.4657904Z x0 = x0.contiguous() 2025-05-07T20:32:54.4657998Z x1 = x1.contiguous() 2025-05-07T20:32:54.4658083Z 2025-05-07T20:32:54.4658177Z if scale_ub is not None: 2025-05-07T20:32:54.4658287Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.4658438Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.4658558Z ) 2025-05-07T20:32:54.4658636Z else: 2025-05-07T20:32:54.4658747Z scale_ub_tensor = None 2025-05-07T20:32:54.4658840Z 2025-05-07T20:32:54.4659001Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4659107Z op = silu_mul_quant 2025-05-07T20:32:54.4659195Z if compiled: 2025-05-07T20:32:54.4659303Z op = torch.compile(op) 2025-05-07T20:32:54.4659415Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4659490Z 2025-05-07T20:32:54.4659590Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.4659595Z 2025-05-07T20:32:54.4659696Z moe/activation_test.py:117: 2025-05-07T20:32:54.4659837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4659954Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.4660059Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4660675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.4660779Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.4661214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.4661479Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.4661891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.4661989Z kernel = self.compile( 2025-05-07T20:32:54.4662455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.4662649Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.4662797Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4662802Z 2025-05-07T20:32:54.4663032Z self = 2025-05-07T20:32:54.4664053Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.4664693Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0dd91c60>} 2025-05-07T20:32:54.4665621Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.4665881Z context = 2025-05-07T20:32:54.4665886Z 2025-05-07T20:32:54.4666070Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.4666386Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.4666500Z module_map=module_map) 2025-05-07T20:32:54.4666678Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.4666787Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.4666866Z E ^ 2025-05-07T20:32:54.4667332Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.4667337Z 2025-05-07T20:32:54.4667844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.4667848Z 2025-05-07T20:32:54.4667959Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4668223Z self=, 2025-05-07T20:32:54.4668302Z T=1, 2025-05-07T20:32:54.4668388Z D=5120, 2025-05-07T20:32:54.4668541Z scale_ub=None, 2025-05-07T20:32:54.4668653Z contiguous=True, 2025-05-07T20:32:54.4668742Z compiled=True, 2025-05-07T20:32:54.4668822Z ) 2025-05-07T20:32:54.4669072Z self = 2025-05-07T20:32:54.4669254Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.4669265Z 2025-05-07T20:32:54.4669345Z @given( 2025-05-07T20:32:54.4669473Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4669582Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4669702Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4669826Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4669952Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4670030Z ) 2025-05-07T20:32:54.4670316Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4670418Z def test_silu_mul_quant( 2025-05-07T20:32:54.4670498Z self, 2025-05-07T20:32:54.4670575Z T: int, 2025-05-07T20:32:54.4670658Z D: int, 2025-05-07T20:32:54.4670764Z scale_ub: Optional[float], 2025-05-07T20:32:54.4670861Z contiguous: bool, 2025-05-07T20:32:54.4670954Z compiled: bool, 2025-05-07T20:32:54.4671032Z ) -> None: 2025-05-07T20:32:54.4671136Z torch.manual_seed(2025) 2025-05-07T20:32:54.4671210Z 2025-05-07T20:32:54.4671398Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4671482Z 2025-05-07T20:32:54.4671576Z x_sign = torch.sign(x) 2025-05-07T20:32:54.4671708Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.4671804Z x = x_sign * x_clamp 2025-05-07T20:32:54.4671890Z x0 = x[:, :D] 2025-05-07T20:32:54.4671971Z x1 = x[:, D:] 2025-05-07T20:32:54.4672049Z 2025-05-07T20:32:54.4672135Z if contiguous: 2025-05-07T20:32:54.4672234Z x0 = x0.contiguous() 2025-05-07T20:32:54.4672328Z x1 = x1.contiguous() 2025-05-07T20:32:54.4672402Z 2025-05-07T20:32:54.4672503Z if scale_ub is not None: 2025-05-07T20:32:54.4672611Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.4672807Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.4672895Z ) 2025-05-07T20:32:54.4672974Z else: 2025-05-07T20:32:54.4673074Z scale_ub_tensor = None 2025-05-07T20:32:54.4673159Z 2025-05-07T20:32:54.4673297Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4673389Z op = silu_mul_quant 2025-05-07T20:32:54.4673483Z if compiled: 2025-05-07T20:32:54.4674196Z op = torch.compile(op) 2025-05-07T20:32:54.4674357Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4674429Z 2025-05-07T20:32:54.4674521Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.4674649Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.4674725Z 2025-05-07T20:32:54.4674862Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4674971Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.4675075Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.4675198Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.4675392Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.4675466Z 2025-05-07T20:32:54.4675574Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.4675579Z 2025-05-07T20:32:54.4675676Z moe/activation_test.py:126: 2025-05-07T20:32:54.4675810Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4675925Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.4676062Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.4676645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.4676796Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.4677171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.4677408Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.4677792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.4678055Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.4678452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.4678625Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.4678987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.4679068Z fn() 2025-05-07T20:32:54.4679483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.4679575Z self.fn.run( 2025-05-07T20:32:54.4679928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.4680020Z kernel = self.compile( 2025-05-07T20:32:54.4680423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.4680601Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.4680739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4680747Z 2025-05-07T20:32:54.4680956Z self = 2025-05-07T20:32:54.4681768Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.4682369Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0dd932e0>} 2025-05-07T20:32:54.4683150Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.4683354Z context = 2025-05-07T20:32:54.4683359Z 2025-05-07T20:32:54.4683526Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.4683839Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.4683951Z module_map=module_map) 2025-05-07T20:32:54.4684118Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.4684225Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.4684302Z E ^ 2025-05-07T20:32:54.4684671Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.4684676Z 2025-05-07T20:32:54.4685157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.4685162Z 2025-05-07T20:32:54.4685266Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4685501Z self=, 2025-05-07T20:32:54.4685580Z T=2048, 2025-05-07T20:32:54.4685656Z D=5120, 2025-05-07T20:32:54.4685746Z scale_ub=None, 2025-05-07T20:32:54.4685831Z contiguous=True, 2025-05-07T20:32:54.4685915Z compiled=True, 2025-05-07T20:32:54.4685993Z ) 2025-05-07T20:32:54.4686260Z self = 2025-05-07T20:32:54.4686438Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.4686442Z 2025-05-07T20:32:54.4686529Z @given( 2025-05-07T20:32:54.4686650Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4686756Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4686875Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4686993Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4687114Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4687187Z ) 2025-05-07T20:32:54.4687439Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4687540Z def test_silu_mul_quant( 2025-05-07T20:32:54.4687617Z self, 2025-05-07T20:32:54.4687693Z T: int, 2025-05-07T20:32:54.4687775Z D: int, 2025-05-07T20:32:54.4687875Z scale_ub: Optional[float], 2025-05-07T20:32:54.4687967Z contiguous: bool, 2025-05-07T20:32:54.4688061Z compiled: bool, 2025-05-07T20:32:54.4688148Z ) -> None: 2025-05-07T20:32:54.4688265Z torch.manual_seed(2025) 2025-05-07T20:32:54.4688354Z 2025-05-07T20:32:54.4688536Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4688615Z 2025-05-07T20:32:54.4688712Z x_sign = torch.sign(x) 2025-05-07T20:32:54.4688842Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.4688938Z x = x_sign * x_clamp 2025-05-07T20:32:54.4689019Z x0 = x[:, :D] 2025-05-07T20:32:54.4689100Z x1 = x[:, D:] 2025-05-07T20:32:54.4689182Z 2025-05-07T20:32:54.4689272Z if contiguous: 2025-05-07T20:32:54.4689363Z x0 = x0.contiguous() 2025-05-07T20:32:54.4689459Z x1 = x1.contiguous() 2025-05-07T20:32:54.4689534Z 2025-05-07T20:32:54.4689624Z if scale_ub is not None: 2025-05-07T20:32:54.4689739Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.4689877Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.4689959Z ) 2025-05-07T20:32:54.4690035Z else: 2025-05-07T20:32:54.4690178Z scale_ub_tensor = None 2025-05-07T20:32:54.4690259Z 2025-05-07T20:32:54.4690388Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4690482Z op = silu_mul_quant 2025-05-07T20:32:54.4690573Z if compiled: 2025-05-07T20:32:54.4690673Z op = torch.compile(op) 2025-05-07T20:32:54.4690779Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4690861Z 2025-05-07T20:32:54.4690955Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.4691118Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.4691198Z 2025-05-07T20:32:54.4691334Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4691447Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.4691548Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.4691671Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.4691906Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.4691980Z 2025-05-07T20:32:54.4692081Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.4692086Z 2025-05-07T20:32:54.4692241Z moe/activation_test.py:126: 2025-05-07T20:32:54.4692374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4692479Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.4692623Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.4693209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.4693317Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.4693732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.4693959Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.4694349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.4707740Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.4708179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.4708360Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.4708720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.4708798Z fn() 2025-05-07T20:32:54.4709216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.4709301Z self.fn.run( 2025-05-07T20:32:54.4709646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.4709746Z kernel = self.compile( 2025-05-07T20:32:54.4710140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.4710321Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.4710451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4710457Z 2025-05-07T20:32:54.4710667Z self = 2025-05-07T20:32:54.4711483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.4712009Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb32475ee0>} 2025-05-07T20:32:54.4712958Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.4713156Z context = 2025-05-07T20:32:54.4713161Z 2025-05-07T20:32:54.4713326Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.4713595Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.4713792Z module_map=module_map) 2025-05-07T20:32:54.4713955Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.4714052Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.4714128Z E ^ 2025-05-07T20:32:54.4714494Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.4714499Z 2025-05-07T20:32:54.4714926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.4714930Z 2025-05-07T20:32:54.4715097Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4715323Z self=, 2025-05-07T20:32:54.4715397Z T=128, 2025-05-07T20:32:54.4715473Z D=5120, 2025-05-07T20:32:54.4715552Z scale_ub=None, 2025-05-07T20:32:54.4715634Z contiguous=True, 2025-05-07T20:32:54.4715719Z compiled=True, 2025-05-07T20:32:54.4715787Z ) 2025-05-07T20:32:54.4716008Z self = 2025-05-07T20:32:54.4716178Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.4716249Z 2025-05-07T20:32:54.4716323Z @given( 2025-05-07T20:32:54.4716442Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4716537Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4716651Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4716768Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4716882Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4716956Z ) 2025-05-07T20:32:54.4717215Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4717309Z def test_silu_mul_quant( 2025-05-07T20:32:54.4717389Z self, 2025-05-07T20:32:54.4717468Z T: int, 2025-05-07T20:32:54.4717546Z D: int, 2025-05-07T20:32:54.4717648Z scale_ub: Optional[float], 2025-05-07T20:32:54.4717739Z contiguous: bool, 2025-05-07T20:32:54.4717827Z compiled: bool, 2025-05-07T20:32:54.4717918Z ) -> None: 2025-05-07T20:32:54.4718014Z torch.manual_seed(2025) 2025-05-07T20:32:54.4718085Z 2025-05-07T20:32:54.4718264Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4718336Z 2025-05-07T20:32:54.4718431Z x_sign = torch.sign(x) 2025-05-07T20:32:54.4718564Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.4718676Z x = x_sign * x_clamp 2025-05-07T20:32:54.4718762Z x0 = x[:, :D] 2025-05-07T20:32:54.4718865Z x1 = x[:, D:] 2025-05-07T20:32:54.4718936Z 2025-05-07T20:32:54.4719025Z if contiguous: 2025-05-07T20:32:54.4719115Z x0 = x0.contiguous() 2025-05-07T20:32:54.4719201Z x1 = x1.contiguous() 2025-05-07T20:32:54.4719280Z 2025-05-07T20:32:54.4719368Z if scale_ub is not None: 2025-05-07T20:32:54.4719474Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.4719616Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.4719695Z ) 2025-05-07T20:32:54.4719772Z else: 2025-05-07T20:32:54.4719871Z scale_ub_tensor = None 2025-05-07T20:32:54.4719942Z 2025-05-07T20:32:54.4720118Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4720214Z op = silu_mul_quant 2025-05-07T20:32:54.4720297Z if compiled: 2025-05-07T20:32:54.4720404Z op = torch.compile(op) 2025-05-07T20:32:54.4720510Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4720581Z 2025-05-07T20:32:54.4720679Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.4720799Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.4720870Z 2025-05-07T20:32:54.4721055Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4721159Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.4721258Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.4721388Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.4721536Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.4721608Z 2025-05-07T20:32:54.4721719Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.4721726Z 2025-05-07T20:32:54.4721824Z moe/activation_test.py:126: 2025-05-07T20:32:54.4721964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4722113Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.4722250Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.4722839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.4722943Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.4723326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.4723557Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.4723978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.4724253Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.4724645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.4724816Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.4725176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.4725255Z fn() 2025-05-07T20:32:54.4725681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.4725765Z self.fn.run( 2025-05-07T20:32:54.4726114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.4726217Z kernel = self.compile( 2025-05-07T20:32:54.4726610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.4726792Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.4726932Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4726937Z 2025-05-07T20:32:54.4727150Z self = 2025-05-07T20:32:54.4727971Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.4728497Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0d6fdd00>} 2025-05-07T20:32:54.4729373Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.4729570Z context = 2025-05-07T20:32:54.4729574Z 2025-05-07T20:32:54.4729745Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.4730023Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.4730129Z module_map=module_map) 2025-05-07T20:32:54.4730299Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.4730441Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.4730517Z E ^ 2025-05-07T20:32:54.4730890Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.4730898Z 2025-05-07T20:32:54.4731328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.4731333Z 2025-05-07T20:32:54.4731437Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4731955Z self=, 2025-05-07T20:32:54.4732088Z T=4096, 2025-05-07T20:32:54.4732172Z D=5120, 2025-05-07T20:32:54.4732255Z scale_ub=None, 2025-05-07T20:32:54.4732341Z contiguous=True, 2025-05-07T20:32:54.4732429Z compiled=True, 2025-05-07T20:32:54.4732501Z ) 2025-05-07T20:32:54.4732727Z self = 2025-05-07T20:32:54.4732909Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.4732913Z 2025-05-07T20:32:54.4732991Z @given( 2025-05-07T20:32:54.4733111Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4733257Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4733374Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4733743Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4734086Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4734372Z ) 2025-05-07T20:32:54.4734735Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4735191Z def test_silu_mul_quant( 2025-05-07T20:32:54.4735430Z self, 2025-05-07T20:32:54.4735626Z T: int, 2025-05-07T20:32:54.4735826Z D: int, 2025-05-07T20:32:54.4736038Z scale_ub: Optional[float], 2025-05-07T20:32:54.4736317Z contiguous: bool, 2025-05-07T20:32:54.4736567Z compiled: bool, 2025-05-07T20:32:54.4736787Z ) -> None: 2025-05-07T20:32:54.4737009Z torch.manual_seed(2025) 2025-05-07T20:32:54.4737257Z 2025-05-07T20:32:54.4737531Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4737875Z 2025-05-07T20:32:54.4738069Z x_sign = torch.sign(x) 2025-05-07T20:32:54.4738371Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.4738727Z x = x_sign * x_clamp 2025-05-07T20:32:54.4738971Z x0 = x[:, :D] 2025-05-07T20:32:54.4739192Z x1 = x[:, D:] 2025-05-07T20:32:54.4739395Z 2025-05-07T20:32:54.4739586Z if contiguous: 2025-05-07T20:32:54.4739819Z x0 = x0.contiguous() 2025-05-07T20:32:54.4740072Z x1 = x1.contiguous() 2025-05-07T20:32:54.4740338Z 2025-05-07T20:32:54.4740534Z if scale_ub is not None: 2025-05-07T20:32:54.4740813Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.4741149Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.4741461Z ) 2025-05-07T20:32:54.4741654Z else: 2025-05-07T20:32:54.4741861Z scale_ub_tensor = None 2025-05-07T20:32:54.4742116Z 2025-05-07T20:32:54.4742348Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4742669Z op = silu_mul_quant 2025-05-07T20:32:54.4742916Z if compiled: 2025-05-07T20:32:54.4743217Z op = torch.compile(op) 2025-05-07T20:32:54.4743517Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4743789Z 2025-05-07T20:32:54.4743985Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.4744275Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.4744564Z 2025-05-07T20:32:54.4744808Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4745148Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.4745485Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.4745805Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.4746170Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.4746488Z 2025-05-07T20:32:54.4746685Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.4746891Z 2025-05-07T20:32:54.4746991Z moe/activation_test.py:126: 2025-05-07T20:32:54.4747298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4747634Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.4747966Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.4748885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.4749664Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.4750227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.4750942Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.4751655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.4752440Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.4753200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.4753857Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.4754480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.4755008Z fn() 2025-05-07T20:32:54.4755538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.4756139Z self.fn.run( 2025-05-07T20:32:54.4756615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.4757168Z kernel = self.compile( 2025-05-07T20:32:54.4757724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.4758404Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.4758806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4759047Z 2025-05-07T20:32:54.4759261Z self = 2025-05-07T20:32:54.4760385Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.4761831Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0d63b560>} 2025-05-07T20:32:54.4763222Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.4764286Z context = 2025-05-07T20:32:54.4764660Z 2025-05-07T20:32:54.4764831Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.4765372Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.4765848Z module_map=module_map) 2025-05-07T20:32:54.4766224Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.4766585Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.4766858Z E ^ 2025-05-07T20:32:54.4767376Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.4767848Z 2025-05-07T20:32:54.4768280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.4768816Z 2025-05-07T20:32:54.4768927Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4769350Z self=, 2025-05-07T20:32:54.4769766Z T=16384, 2025-05-07T20:32:54.4769962Z D=5120, 2025-05-07T20:32:54.4770158Z scale_ub=None, 2025-05-07T20:32:54.4770413Z contiguous=True, 2025-05-07T20:32:54.4770642Z compiled=True, 2025-05-07T20:32:54.4770848Z ) 2025-05-07T20:32:54.4771169Z self = 2025-05-07T20:32:54.4771677Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.4772052Z 2025-05-07T20:32:54.4772138Z @given( 2025-05-07T20:32:54.4772368Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4772696Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4773006Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4773419Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4773793Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4774111Z ) 2025-05-07T20:32:54.4774515Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4775025Z def test_silu_mul_quant( 2025-05-07T20:32:54.4775285Z self, 2025-05-07T20:32:54.4775492Z T: int, 2025-05-07T20:32:54.4775708Z D: int, 2025-05-07T20:32:54.4775940Z scale_ub: Optional[float], 2025-05-07T20:32:54.4776232Z contiguous: bool, 2025-05-07T20:32:54.4776490Z compiled: bool, 2025-05-07T20:32:54.4776729Z ) -> None: 2025-05-07T20:32:54.4776951Z torch.manual_seed(2025) 2025-05-07T20:32:54.4777217Z 2025-05-07T20:32:54.4777513Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4777893Z 2025-05-07T20:32:54.4778100Z x_sign = torch.sign(x) 2025-05-07T20:32:54.4778421Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.4778781Z x = x_sign * x_clamp 2025-05-07T20:32:54.4779072Z x0 = x[:, :D] 2025-05-07T20:32:54.4779306Z x1 = x[:, D:] 2025-05-07T20:32:54.4779526Z 2025-05-07T20:32:54.4779724Z if contiguous: 2025-05-07T20:32:54.4779973Z x0 = x0.contiguous() 2025-05-07T20:32:54.4780255Z x1 = x1.contiguous() 2025-05-07T20:32:54.4780512Z 2025-05-07T20:32:54.4780713Z if scale_ub is not None: 2025-05-07T20:32:54.4781008Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.4781372Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.4781716Z ) 2025-05-07T20:32:54.4781920Z else: 2025-05-07T20:32:54.4782137Z scale_ub_tensor = None 2025-05-07T20:32:54.4782412Z 2025-05-07T20:32:54.4782657Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4783002Z op = silu_mul_quant 2025-05-07T20:32:54.4783275Z if compiled: 2025-05-07T20:32:54.4783541Z op = torch.compile(op) 2025-05-07T20:32:54.4783860Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4784161Z 2025-05-07T20:32:54.4784418Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.4784726Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.4785052Z 2025-05-07T20:32:54.4785307Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4785682Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.4786001Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.4786350Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.4786807Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.4787117Z 2025-05-07T20:32:54.4787321Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.4787520Z 2025-05-07T20:32:54.4787626Z moe/activation_test.py:126: 2025-05-07T20:32:54.4787926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4788269Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.4788602Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.4789516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.4790289Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.4790853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.4791562Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.4792281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.4793024Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.4793824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.4794489Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.4795103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.4795639Z fn() 2025-05-07T20:32:54.4796174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.4796773Z self.fn.run( 2025-05-07T20:32:54.4797250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.4797802Z kernel = self.compile( 2025-05-07T20:32:54.4798357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.4799077Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.4799489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4799732Z 2025-05-07T20:32:54.4799947Z self = 2025-05-07T20:32:54.4801078Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.4802509Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0ccb9620>} 2025-05-07T20:32:54.4803907Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.4804971Z context = 2025-05-07T20:32:54.4805268Z 2025-05-07T20:32:54.4805442Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.4806028Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.4806793Z module_map=module_map) 2025-05-07T20:32:54.4807177Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.4807540Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.4807804Z E ^ 2025-05-07T20:32:54.4808285Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.4808918Z 2025-05-07T20:32:54.4809355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.4809889Z 2025-05-07T20:32:54.4810005Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4810429Z self=, 2025-05-07T20:32:54.4810847Z T=1, 2025-05-07T20:32:54.4811035Z D=5120, 2025-05-07T20:32:54.4811223Z scale_ub=1200.0, 2025-05-07T20:32:54.4811451Z contiguous=True, 2025-05-07T20:32:54.4811675Z compiled=True, 2025-05-07T20:32:54.4811957Z ) 2025-05-07T20:32:54.4812361Z self = 2025-05-07T20:32:54.4812866Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.4813135Z 2025-05-07T20:32:54.4813216Z @given( 2025-05-07T20:32:54.4813443Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4813762Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4814077Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4814409Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4814747Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4815119Z ) 2025-05-07T20:32:54.4815472Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4815924Z def test_silu_mul_quant( 2025-05-07T20:32:54.4816172Z self, 2025-05-07T20:32:54.4816362Z T: int, 2025-05-07T20:32:54.4816561Z D: int, 2025-05-07T20:32:54.4816782Z scale_ub: Optional[float], 2025-05-07T20:32:54.4817055Z contiguous: bool, 2025-05-07T20:32:54.4817301Z compiled: bool, 2025-05-07T20:32:54.4817526Z ) -> None: 2025-05-07T20:32:54.4817745Z torch.manual_seed(2025) 2025-05-07T20:32:54.4817987Z 2025-05-07T20:32:54.4818267Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4818619Z 2025-05-07T20:32:54.4818806Z x_sign = torch.sign(x) 2025-05-07T20:32:54.4819101Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.4819418Z x = x_sign * x_clamp 2025-05-07T20:32:54.4819661Z x0 = x[:, :D] 2025-05-07T20:32:54.4819878Z x1 = x[:, D:] 2025-05-07T20:32:54.4820089Z 2025-05-07T20:32:54.4820272Z if contiguous: 2025-05-07T20:32:54.4820508Z x0 = x0.contiguous() 2025-05-07T20:32:54.4820771Z x1 = x1.contiguous() 2025-05-07T20:32:54.4821010Z 2025-05-07T20:32:54.4821202Z if scale_ub is not None: 2025-05-07T20:32:54.4821481Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.4821815Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.4822127Z ) 2025-05-07T20:32:54.4822323Z else: 2025-05-07T20:32:54.4822534Z scale_ub_tensor = None 2025-05-07T20:32:54.4822781Z 2025-05-07T20:32:54.4823015Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4823338Z op = silu_mul_quant 2025-05-07T20:32:54.4823585Z if compiled: 2025-05-07T20:32:54.4823834Z op = torch.compile(op) 2025-05-07T20:32:54.4824139Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4824410Z 2025-05-07T20:32:54.4824604Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.4824771Z 2025-05-07T20:32:54.4824876Z moe/activation_test.py:117: 2025-05-07T20:32:54.4825244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4825585Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.4825874Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4826448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.4827019Z return fn(*args, **kwargs) 2025-05-07T20:32:54.4827694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.4828445Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.4828990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.4829694Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.4830376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.4830921Z kernel = self.compile( 2025-05-07T20:32:54.4831511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.4832197Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.4832605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4832840Z 2025-05-07T20:32:54.4833062Z self = 2025-05-07T20:32:54.4834177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.4835683Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c844720>} 2025-05-07T20:32:54.4837077Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.4838143Z context = 2025-05-07T20:32:54.4838440Z 2025-05-07T20:32:54.4838611Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.4839207Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.4839690Z module_map=module_map) 2025-05-07T20:32:54.4840067Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.4840429Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.4840694Z E ^ 2025-05-07T20:32:54.4841179Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.4841644Z 2025-05-07T20:32:54.4842075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.4842611Z 2025-05-07T20:32:54.4842713Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4843134Z self=, 2025-05-07T20:32:54.4843546Z T=1, 2025-05-07T20:32:54.4843725Z D=5120, 2025-05-07T20:32:54.4843919Z scale_ub=None, 2025-05-07T20:32:54.4844140Z contiguous=False, 2025-05-07T20:32:54.4844359Z compiled=True, 2025-05-07T20:32:54.4844559Z ) 2025-05-07T20:32:54.4844887Z self = 2025-05-07T20:32:54.4845385Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.4845660Z 2025-05-07T20:32:54.4845737Z @given( 2025-05-07T20:32:54.4845970Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4846336Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4846641Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4846978Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4847316Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4847598Z ) 2025-05-07T20:32:54.4847957Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4848410Z def test_silu_mul_quant( 2025-05-07T20:32:54.4848690Z self, 2025-05-07T20:32:54.4848885Z T: int, 2025-05-07T20:32:54.4849086Z D: int, 2025-05-07T20:32:54.4849339Z scale_ub: Optional[float], 2025-05-07T20:32:54.4849620Z contiguous: bool, 2025-05-07T20:32:54.4849864Z compiled: bool, 2025-05-07T20:32:54.4850083Z ) -> None: 2025-05-07T20:32:54.4850301Z torch.manual_seed(2025) 2025-05-07T20:32:54.4850547Z 2025-05-07T20:32:54.4850818Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4851169Z 2025-05-07T20:32:54.4851363Z x_sign = torch.sign(x) 2025-05-07T20:32:54.4858366Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.4858726Z x = x_sign * x_clamp 2025-05-07T20:32:54.4858971Z x0 = x[:, :D] 2025-05-07T20:32:54.4859184Z x1 = x[:, D:] 2025-05-07T20:32:54.4859389Z 2025-05-07T20:32:54.4859577Z if contiguous: 2025-05-07T20:32:54.4859806Z x0 = x0.contiguous() 2025-05-07T20:32:54.4860072Z x1 = x1.contiguous() 2025-05-07T20:32:54.4860309Z 2025-05-07T20:32:54.4860495Z if scale_ub is not None: 2025-05-07T20:32:54.4860773Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.4861165Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.4861475Z ) 2025-05-07T20:32:54.4861673Z else: 2025-05-07T20:32:54.4861886Z scale_ub_tensor = None 2025-05-07T20:32:54.4862138Z 2025-05-07T20:32:54.4862380Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4862700Z op = silu_mul_quant 2025-05-07T20:32:54.4862958Z if compiled: 2025-05-07T20:32:54.4863205Z op = torch.compile(op) 2025-05-07T20:32:54.4863507Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4863782Z 2025-05-07T20:32:54.4863971Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.4864262Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.4864560Z 2025-05-07T20:32:54.4864795Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4865138Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.4865439Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.4865756Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.4866119Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.4866434Z 2025-05-07T20:32:54.4866639Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.4866838Z 2025-05-07T20:32:54.4866938Z moe/activation_test.py:126: 2025-05-07T20:32:54.4867245Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4867592Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.4867919Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.4868740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.4869520Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.4870080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.4870778Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.4871543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.4872291Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.4873039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.4873696Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.4874316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.4874896Z fn() 2025-05-07T20:32:54.4875412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.4876011Z self.fn.run( 2025-05-07T20:32:54.4876492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.4877033Z kernel = self.compile( 2025-05-07T20:32:54.4877588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.4878262Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.4878709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4878946Z 2025-05-07T20:32:54.4879158Z self = 2025-05-07T20:32:54.4880279Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.4881703Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c846c00>} 2025-05-07T20:32:54.4883136Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.4884197Z context = 2025-05-07T20:32:54.4884496Z 2025-05-07T20:32:54.4884664Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.4885203Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.4885684Z module_map=module_map) 2025-05-07T20:32:54.4886055Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.4886418Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.4886688Z E ^ 2025-05-07T20:32:54.4887167Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.4887634Z 2025-05-07T20:32:54.4888065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.4888602Z 2025-05-07T20:32:54.4888705Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4889124Z self=, 2025-05-07T20:32:54.4889534Z T=1, 2025-05-07T20:32:54.4889710Z D=5120, 2025-05-07T20:32:54.4889907Z scale_ub=None, 2025-05-07T20:32:54.4890122Z contiguous=True, 2025-05-07T20:32:54.4890340Z compiled=False, 2025-05-07T20:32:54.4890545Z ) 2025-05-07T20:32:54.4890881Z self = 2025-05-07T20:32:54.4891372Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.4891646Z 2025-05-07T20:32:54.4891723Z @given( 2025-05-07T20:32:54.4892086Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4892400Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4892711Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4893096Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4893440Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4893727Z ) 2025-05-07T20:32:54.4894086Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4894539Z def test_silu_mul_quant( 2025-05-07T20:32:54.4894779Z self, 2025-05-07T20:32:54.4894972Z T: int, 2025-05-07T20:32:54.4895168Z D: int, 2025-05-07T20:32:54.4895383Z scale_ub: Optional[float], 2025-05-07T20:32:54.4895705Z contiguous: bool, 2025-05-07T20:32:54.4895947Z compiled: bool, 2025-05-07T20:32:54.4896166Z ) -> None: 2025-05-07T20:32:54.4896383Z torch.manual_seed(2025) 2025-05-07T20:32:54.4896636Z 2025-05-07T20:32:54.4896905Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4897254Z 2025-05-07T20:32:54.4897446Z x_sign = torch.sign(x) 2025-05-07T20:32:54.4897736Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.4898054Z x = x_sign * x_clamp 2025-05-07T20:32:54.4898295Z x0 = x[:, :D] 2025-05-07T20:32:54.4898634Z x1 = x[:, D:] 2025-05-07T20:32:54.4898845Z 2025-05-07T20:32:54.4899031Z if contiguous: 2025-05-07T20:32:54.4899263Z x0 = x0.contiguous() 2025-05-07T20:32:54.4899515Z x1 = x1.contiguous() 2025-05-07T20:32:54.4899766Z 2025-05-07T20:32:54.4899958Z if scale_ub is not None: 2025-05-07T20:32:54.4900230Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.4900572Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.4900887Z ) 2025-05-07T20:32:54.4901075Z else: 2025-05-07T20:32:54.4901340Z scale_ub_tensor = None 2025-05-07T20:32:54.4901595Z 2025-05-07T20:32:54.4901823Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4902146Z op = silu_mul_quant 2025-05-07T20:32:54.4902409Z if compiled: 2025-05-07T20:32:54.4902659Z op = torch.compile(op) 2025-05-07T20:32:54.4902960Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4903241Z 2025-05-07T20:32:54.4903429Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.4903602Z 2025-05-07T20:32:54.4903701Z moe/activation_test.py:117: 2025-05-07T20:32:54.4904007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4904346Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.4904628Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4905341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.4906053Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.4906942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.4907654Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.4908339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.4908935Z kernel = self.compile( 2025-05-07T20:32:54.4909484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.4910156Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.4910564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4910799Z 2025-05-07T20:32:54.4911017Z self = 2025-05-07T20:32:54.4912129Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.4913715Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c8476a0>} 2025-05-07T20:32:54.4915112Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.4916171Z context = 2025-05-07T20:32:54.4916540Z 2025-05-07T20:32:54.4916709Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.4917246Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.4917725Z module_map=module_map) 2025-05-07T20:32:54.4918094Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.4918445Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.4918707Z E ^ 2025-05-07T20:32:54.4919333Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.4919801Z 2025-05-07T20:32:54.4920235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.4920765Z 2025-05-07T20:32:54.4920869Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4921293Z self=, 2025-05-07T20:32:54.4921708Z T=128, 2025-05-07T20:32:54.4921892Z D=5120, 2025-05-07T20:32:54.4922088Z scale_ub=None, 2025-05-07T20:32:54.4922304Z contiguous=False, 2025-05-07T20:32:54.4922599Z compiled=True, 2025-05-07T20:32:54.4922804Z ) 2025-05-07T20:32:54.4923133Z self = 2025-05-07T20:32:54.4923635Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.4923918Z 2025-05-07T20:32:54.4923994Z @given( 2025-05-07T20:32:54.4924226Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4924549Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4924854Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4925189Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4925526Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4925809Z ) 2025-05-07T20:32:54.4926174Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4926625Z def test_silu_mul_quant( 2025-05-07T20:32:54.4926865Z self, 2025-05-07T20:32:54.4927063Z T: int, 2025-05-07T20:32:54.4927265Z D: int, 2025-05-07T20:32:54.4927479Z scale_ub: Optional[float], 2025-05-07T20:32:54.4927754Z contiguous: bool, 2025-05-07T20:32:54.4927998Z compiled: bool, 2025-05-07T20:32:54.4928225Z ) -> None: 2025-05-07T20:32:54.4928437Z torch.manual_seed(2025) 2025-05-07T20:32:54.4928681Z 2025-05-07T20:32:54.4928956Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4929296Z 2025-05-07T20:32:54.4929488Z x_sign = torch.sign(x) 2025-05-07T20:32:54.4929781Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.4930088Z x = x_sign * x_clamp 2025-05-07T20:32:54.4930327Z x0 = x[:, :D] 2025-05-07T20:32:54.4930546Z x1 = x[:, D:] 2025-05-07T20:32:54.4930747Z 2025-05-07T20:32:54.4930932Z if contiguous: 2025-05-07T20:32:54.4931163Z x0 = x0.contiguous() 2025-05-07T20:32:54.4931415Z x1 = x1.contiguous() 2025-05-07T20:32:54.4931658Z 2025-05-07T20:32:54.4931925Z if scale_ub is not None: 2025-05-07T20:32:54.4932209Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.4932545Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.4932911Z ) 2025-05-07T20:32:54.4933107Z else: 2025-05-07T20:32:54.4933322Z scale_ub_tensor = None 2025-05-07T20:32:54.4933568Z 2025-05-07T20:32:54.4933805Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4934124Z op = silu_mul_quant 2025-05-07T20:32:54.4934370Z if compiled: 2025-05-07T20:32:54.4934619Z op = torch.compile(op) 2025-05-07T20:32:54.4934915Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4935233Z 2025-05-07T20:32:54.4935428Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.4935592Z 2025-05-07T20:32:54.4935699Z moe/activation_test.py:117: 2025-05-07T20:32:54.4935992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4936335Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.4936618Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4937192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.4937762Z return fn(*args, **kwargs) 2025-05-07T20:32:54.4938530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.4939237Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.4939604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.4939835Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.4940189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.4940326Z kernel = self.compile( 2025-05-07T20:32:54.4940722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.4940903Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.4941033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4941038Z 2025-05-07T20:32:54.4941253Z self = 2025-05-07T20:32:54.4942056Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.4942584Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c8451c0>} 2025-05-07T20:32:54.4943356Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.4943559Z context = 2025-05-07T20:32:54.4943564Z 2025-05-07T20:32:54.4943730Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.4944002Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.4944116Z module_map=module_map) 2025-05-07T20:32:54.4944279Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.4944377Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.4944460Z E ^ 2025-05-07T20:32:54.4944824Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.4944832Z 2025-05-07T20:32:54.4945263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.4945268Z 2025-05-07T20:32:54.4945370Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4945640Z self=, 2025-05-07T20:32:54.4945726Z T=128, 2025-05-07T20:32:54.4945802Z D=7168, 2025-05-07T20:32:54.4945888Z scale_ub=1200.0, 2025-05-07T20:32:54.4945981Z contiguous=False, 2025-05-07T20:32:54.4946065Z compiled=False, 2025-05-07T20:32:54.4946143Z ) 2025-05-07T20:32:54.4946369Z self = 2025-05-07T20:32:54.4946543Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.4946588Z 2025-05-07T20:32:54.4946670Z @given( 2025-05-07T20:32:54.4946791Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4946889Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4947017Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4947133Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4947247Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4947327Z ) 2025-05-07T20:32:54.4947578Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4947720Z def test_silu_mul_quant( 2025-05-07T20:32:54.4947797Z self, 2025-05-07T20:32:54.4947874Z T: int, 2025-05-07T20:32:54.4947955Z D: int, 2025-05-07T20:32:54.4948053Z scale_ub: Optional[float], 2025-05-07T20:32:54.4948143Z contiguous: bool, 2025-05-07T20:32:54.4948234Z compiled: bool, 2025-05-07T20:32:54.4948314Z ) -> None: 2025-05-07T20:32:54.4948407Z torch.manual_seed(2025) 2025-05-07T20:32:54.4948485Z 2025-05-07T20:32:54.4948681Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4948823Z 2025-05-07T20:32:54.4948922Z x_sign = torch.sign(x) 2025-05-07T20:32:54.4949045Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.4949140Z x = x_sign * x_clamp 2025-05-07T20:32:54.4949223Z x0 = x[:, :D] 2025-05-07T20:32:54.4949303Z x1 = x[:, D:] 2025-05-07T20:32:54.4949385Z 2025-05-07T20:32:54.4949468Z if contiguous: 2025-05-07T20:32:54.4949561Z x0 = x0.contiguous() 2025-05-07T20:32:54.4949657Z x1 = x1.contiguous() 2025-05-07T20:32:54.4949729Z 2025-05-07T20:32:54.4949817Z if scale_ub is not None: 2025-05-07T20:32:54.4949930Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.4950065Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.4950142Z ) 2025-05-07T20:32:54.4950226Z else: 2025-05-07T20:32:54.4950319Z scale_ub_tensor = None 2025-05-07T20:32:54.4950391Z 2025-05-07T20:32:54.4950530Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4950623Z op = silu_mul_quant 2025-05-07T20:32:54.4950712Z if compiled: 2025-05-07T20:32:54.4950811Z op = torch.compile(op) 2025-05-07T20:32:54.4950918Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4950995Z 2025-05-07T20:32:54.4951086Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.4951090Z 2025-05-07T20:32:54.4951190Z moe/activation_test.py:117: 2025-05-07T20:32:54.4951327Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4951427Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.4951525Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4952042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.4952139Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.4952512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.4952744Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.4953143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.4953246Z kernel = self.compile( 2025-05-07T20:32:54.4953642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.4953828Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.4953957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4953962Z 2025-05-07T20:32:54.4954213Z self = 2025-05-07T20:32:54.4955021Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.4955546Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0ccb82c0>} 2025-05-07T20:32:54.4956357Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.4956553Z context = 2025-05-07T20:32:54.4956558Z 2025-05-07T20:32:54.4956728Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.4957006Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.4957111Z module_map=module_map) 2025-05-07T20:32:54.4957321Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.4957420Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.4957496Z E ^ 2025-05-07T20:32:54.4957868Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.4957873Z 2025-05-07T20:32:54.4958300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.4958305Z 2025-05-07T20:32:54.4958413Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4958641Z self=, 2025-05-07T20:32:54.4958718Z T=128, 2025-05-07T20:32:54.4958800Z D=5120, 2025-05-07T20:32:54.4958884Z scale_ub=None, 2025-05-07T20:32:54.4958972Z contiguous=False, 2025-05-07T20:32:54.4959060Z compiled=False, 2025-05-07T20:32:54.4959133Z ) 2025-05-07T20:32:54.4959356Z self = 2025-05-07T20:32:54.4959539Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.4959543Z 2025-05-07T20:32:54.4959620Z @given( 2025-05-07T20:32:54.4959747Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4959850Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4959967Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4960089Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4960201Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4960274Z ) 2025-05-07T20:32:54.4960531Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4960625Z def test_silu_mul_quant( 2025-05-07T20:32:54.4960702Z self, 2025-05-07T20:32:54.4960784Z T: int, 2025-05-07T20:32:54.4960860Z D: int, 2025-05-07T20:32:54.4960957Z scale_ub: Optional[float], 2025-05-07T20:32:54.4961058Z contiguous: bool, 2025-05-07T20:32:54.4961143Z compiled: bool, 2025-05-07T20:32:54.4961226Z ) -> None: 2025-05-07T20:32:54.4961319Z torch.manual_seed(2025) 2025-05-07T20:32:54.4961391Z 2025-05-07T20:32:54.4961613Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4961687Z 2025-05-07T20:32:54.4961781Z x_sign = torch.sign(x) 2025-05-07T20:32:54.4961912Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.4962003Z x = x_sign * x_clamp 2025-05-07T20:32:54.4962083Z x0 = x[:, :D] 2025-05-07T20:32:54.4962167Z x1 = x[:, D:] 2025-05-07T20:32:54.4962238Z 2025-05-07T20:32:54.4962322Z if contiguous: 2025-05-07T20:32:54.4962481Z x0 = x0.contiguous() 2025-05-07T20:32:54.4962569Z x1 = x1.contiguous() 2025-05-07T20:32:54.4962645Z 2025-05-07T20:32:54.4962735Z if scale_ub is not None: 2025-05-07T20:32:54.4962838Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.4962983Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.4963057Z ) 2025-05-07T20:32:54.4963133Z else: 2025-05-07T20:32:54.4963238Z scale_ub_tensor = None 2025-05-07T20:32:54.4963311Z 2025-05-07T20:32:54.4963441Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4963590Z op = silu_mul_quant 2025-05-07T20:32:54.4963675Z if compiled: 2025-05-07T20:32:54.4963778Z op = torch.compile(op) 2025-05-07T20:32:54.4963891Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4963964Z 2025-05-07T20:32:54.4964062Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.4964066Z 2025-05-07T20:32:54.4964164Z moe/activation_test.py:117: 2025-05-07T20:32:54.4964295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4964401Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.4964543Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4965057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.4965163Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.4965532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.4965767Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.4966118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.4966212Z kernel = self.compile( 2025-05-07T20:32:54.4966613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.4966795Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.4966924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4966938Z 2025-05-07T20:32:54.4967149Z self = 2025-05-07T20:32:54.4967956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.4968481Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c840400>} 2025-05-07T20:32:54.4969302Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.4969508Z context = 2025-05-07T20:32:54.4969515Z 2025-05-07T20:32:54.4969683Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.4969953Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.4970114Z module_map=module_map) 2025-05-07T20:32:54.4970278Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.4970379Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.4970463Z E ^ 2025-05-07T20:32:54.4970827Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.4970832Z 2025-05-07T20:32:54.4971263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.4971309Z 2025-05-07T20:32:54.4971413Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4971639Z self=, 2025-05-07T20:32:54.4971727Z T=128, 2025-05-07T20:32:54.4971876Z D=5120, 2025-05-07T20:32:54.4971983Z scale_ub=1200.0, 2025-05-07T20:32:54.4972068Z contiguous=True, 2025-05-07T20:32:54.4972150Z compiled=False, 2025-05-07T20:32:54.4972231Z ) 2025-05-07T20:32:54.4972454Z self = 2025-05-07T20:32:54.4972676Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.4972682Z 2025-05-07T20:32:54.4972764Z @given( 2025-05-07T20:32:54.4972887Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4972985Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4973104Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4973223Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4973341Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4973416Z ) 2025-05-07T20:32:54.4973667Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4973810Z def test_silu_mul_quant( 2025-05-07T20:32:54.4973887Z self, 2025-05-07T20:32:54.4973963Z T: int, 2025-05-07T20:32:54.4974048Z D: int, 2025-05-07T20:32:54.4974144Z scale_ub: Optional[float], 2025-05-07T20:32:54.4974232Z contiguous: bool, 2025-05-07T20:32:54.4974322Z compiled: bool, 2025-05-07T20:32:54.4974406Z ) -> None: 2025-05-07T20:32:54.4974500Z torch.manual_seed(2025) 2025-05-07T20:32:54.4974577Z 2025-05-07T20:32:54.4974747Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4974825Z 2025-05-07T20:32:54.4974917Z x_sign = torch.sign(x) 2025-05-07T20:32:54.4975043Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.4975136Z x = x_sign * x_clamp 2025-05-07T20:32:54.4975215Z x0 = x[:, :D] 2025-05-07T20:32:54.4975295Z x1 = x[:, D:] 2025-05-07T20:32:54.4975376Z 2025-05-07T20:32:54.4975459Z if contiguous: 2025-05-07T20:32:54.4975550Z x0 = x0.contiguous() 2025-05-07T20:32:54.4975645Z x1 = x1.contiguous() 2025-05-07T20:32:54.4975716Z 2025-05-07T20:32:54.4975806Z if scale_ub is not None: 2025-05-07T20:32:54.4975918Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.4976056Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.4976132Z ) 2025-05-07T20:32:54.4976214Z else: 2025-05-07T20:32:54.4976308Z scale_ub_tensor = None 2025-05-07T20:32:54.4976387Z 2025-05-07T20:32:54.4976515Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4976605Z op = silu_mul_quant 2025-05-07T20:32:54.4976698Z if compiled: 2025-05-07T20:32:54.4976794Z op = torch.compile(op) 2025-05-07T20:32:54.4976899Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4976981Z 2025-05-07T20:32:54.4977070Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.4977074Z 2025-05-07T20:32:54.4977170Z moe/activation_test.py:117: 2025-05-07T20:32:54.4977359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4977458Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.4977562Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4978077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.4978174Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.4978573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.4978866Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.4979217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.4979321Z kernel = self.compile( 2025-05-07T20:32:54.4979718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.4979908Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.4980037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4980042Z 2025-05-07T20:32:54.4980293Z self = 2025-05-07T20:32:54.4981107Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.4981627Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c841300>} 2025-05-07T20:32:54.4982442Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.4982641Z context = 2025-05-07T20:32:54.4982645Z 2025-05-07T20:32:54.4982821Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.4983090Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.4983196Z module_map=module_map) 2025-05-07T20:32:54.4983367Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.4983467Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.4983544Z E ^ 2025-05-07T20:32:54.4983920Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.4983927Z 2025-05-07T20:32:54.4984351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.4984356Z 2025-05-07T20:32:54.4984465Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4984698Z self=, 2025-05-07T20:32:54.4984776Z T=1, 2025-05-07T20:32:54.4984858Z D=7168, 2025-05-07T20:32:54.4984943Z scale_ub=1200.0, 2025-05-07T20:32:54.4985029Z contiguous=True, 2025-05-07T20:32:54.4985119Z compiled=True, 2025-05-07T20:32:54.4985192Z ) 2025-05-07T20:32:54.4985416Z self = 2025-05-07T20:32:54.4985590Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.4985597Z 2025-05-07T20:32:54.4985675Z @given( 2025-05-07T20:32:54.4985801Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4985899Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4986016Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4986142Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4986303Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4986379Z ) 2025-05-07T20:32:54.4986638Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4986733Z def test_silu_mul_quant( 2025-05-07T20:32:54.4986811Z self, 2025-05-07T20:32:54.4986893Z T: int, 2025-05-07T20:32:54.4986969Z D: int, 2025-05-07T20:32:54.4987071Z scale_ub: Optional[float], 2025-05-07T20:32:54.4987159Z contiguous: bool, 2025-05-07T20:32:54.4987242Z compiled: bool, 2025-05-07T20:32:54.4987368Z ) -> None: 2025-05-07T20:32:54.4987461Z torch.manual_seed(2025) 2025-05-07T20:32:54.4987533Z 2025-05-07T20:32:54.4987711Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4987787Z 2025-05-07T20:32:54.4987879Z x_sign = torch.sign(x) 2025-05-07T20:32:54.4988008Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.4988096Z x = x_sign * x_clamp 2025-05-07T20:32:54.4988175Z x0 = x[:, :D] 2025-05-07T20:32:54.4988260Z x1 = x[:, D:] 2025-05-07T20:32:54.4988331Z 2025-05-07T20:32:54.4988421Z if contiguous: 2025-05-07T20:32:54.4988553Z x0 = x0.contiguous() 2025-05-07T20:32:54.4988643Z x1 = x1.contiguous() 2025-05-07T20:32:54.4988721Z 2025-05-07T20:32:54.4988810Z if scale_ub is not None: 2025-05-07T20:32:54.4988913Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.4989055Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.4989133Z ) 2025-05-07T20:32:54.4989208Z else: 2025-05-07T20:32:54.4989308Z scale_ub_tensor = None 2025-05-07T20:32:54.4989380Z 2025-05-07T20:32:54.4989555Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4989650Z op = silu_mul_quant 2025-05-07T20:32:54.4989735Z if compiled: 2025-05-07T20:32:54.4989839Z op = torch.compile(op) 2025-05-07T20:32:54.4989948Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4990020Z 2025-05-07T20:32:54.4990114Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.4990118Z 2025-05-07T20:32:54.4990217Z moe/activation_test.py:117: 2025-05-07T20:32:54.4990349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4990453Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.4990552Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4990928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.4991029Z return fn(*args, **kwargs) 2025-05-07T20:32:54.4991539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.4991644Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.4992019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.4992248Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.4992608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.4992701Z kernel = self.compile( 2025-05-07T20:32:54.4993098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.4993283Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.4993414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4993418Z 2025-05-07T20:32:54.4993634Z self = 2025-05-07T20:32:54.4994507Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.4995047Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c842ac0>} 2025-05-07T20:32:54.4995820Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.4996055Z context = 2025-05-07T20:32:54.4996060Z 2025-05-07T20:32:54.4996235Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5002355Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5002491Z module_map=module_map) 2025-05-07T20:32:54.5002669Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5002768Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5002844Z E ^ 2025-05-07T20:32:54.5003301Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5003308Z 2025-05-07T20:32:54.5003739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5003744Z 2025-05-07T20:32:54.5003856Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5004086Z self=, 2025-05-07T20:32:54.5004163Z T=1, 2025-05-07T20:32:54.5004245Z D=7168, 2025-05-07T20:32:54.5004328Z scale_ub=1200.0, 2025-05-07T20:32:54.5004458Z contiguous=False, 2025-05-07T20:32:54.5004544Z compiled=True, 2025-05-07T20:32:54.5004619Z ) 2025-05-07T20:32:54.5004841Z self = 2025-05-07T20:32:54.5005022Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.5005027Z 2025-05-07T20:32:54.5005104Z @given( 2025-05-07T20:32:54.5005232Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5005333Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5005449Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5005570Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5005683Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5005757Z ) 2025-05-07T20:32:54.5006016Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5006109Z def test_silu_mul_quant( 2025-05-07T20:32:54.5006456Z self, 2025-05-07T20:32:54.5006574Z T: int, 2025-05-07T20:32:54.5006681Z D: int, 2025-05-07T20:32:54.5006796Z scale_ub: Optional[float], 2025-05-07T20:32:54.5006891Z contiguous: bool, 2025-05-07T20:32:54.5006980Z compiled: bool, 2025-05-07T20:32:54.5007064Z ) -> None: 2025-05-07T20:32:54.5007158Z torch.manual_seed(2025) 2025-05-07T20:32:54.5007231Z 2025-05-07T20:32:54.5007414Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5007487Z 2025-05-07T20:32:54.5007578Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5007709Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5007796Z x = x_sign * x_clamp 2025-05-07T20:32:54.5007878Z x0 = x[:, :D] 2025-05-07T20:32:54.5007963Z x1 = x[:, D:] 2025-05-07T20:32:54.5008034Z 2025-05-07T20:32:54.5008116Z if contiguous: 2025-05-07T20:32:54.5008213Z x0 = x0.contiguous() 2025-05-07T20:32:54.5008305Z x1 = x1.contiguous() 2025-05-07T20:32:54.5008376Z 2025-05-07T20:32:54.5008475Z if scale_ub is not None: 2025-05-07T20:32:54.5008580Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5008868Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5008947Z ) 2025-05-07T20:32:54.5009021Z else: 2025-05-07T20:32:54.5009124Z scale_ub_tensor = None 2025-05-07T20:32:54.5009195Z 2025-05-07T20:32:54.5009324Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5009421Z op = silu_mul_quant 2025-05-07T20:32:54.5009505Z if compiled: 2025-05-07T20:32:54.5009604Z op = torch.compile(op) 2025-05-07T20:32:54.5009782Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5009853Z 2025-05-07T20:32:54.5009943Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5009955Z 2025-05-07T20:32:54.5010053Z moe/activation_test.py:117: 2025-05-07T20:32:54.5010189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5010295Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5010394Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5010779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.5010955Z return fn(*args, **kwargs) 2025-05-07T20:32:54.5011464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5011570Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5012019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5012251Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5012606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5012774Z kernel = self.compile( 2025-05-07T20:32:54.5013166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5013352Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5013484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5013488Z 2025-05-07T20:32:54.5013702Z self = 2025-05-07T20:32:54.5014505Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5015027Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c86f9c0>} 2025-05-07T20:32:54.5015805Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5015998Z context = 2025-05-07T20:32:54.5016003Z 2025-05-07T20:32:54.5016179Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5016449Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5016554Z module_map=module_map) 2025-05-07T20:32:54.5016726Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5016826Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5016906Z E ^ 2025-05-07T20:32:54.5017270Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5017278Z 2025-05-07T20:32:54.5017703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5017708Z 2025-05-07T20:32:54.5017865Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5018095Z self=, 2025-05-07T20:32:54.5018180Z T=1, 2025-05-07T20:32:54.5018254Z D=7168, 2025-05-07T20:32:54.5018336Z scale_ub=None, 2025-05-07T20:32:54.5018428Z contiguous=False, 2025-05-07T20:32:54.5018510Z compiled=True, 2025-05-07T20:32:54.5018580Z ) 2025-05-07T20:32:54.5018809Z self = 2025-05-07T20:32:54.5019040Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.5019045Z 2025-05-07T20:32:54.5019122Z @given( 2025-05-07T20:32:54.5019249Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5019351Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5019470Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5019588Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5019704Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5019782Z ) 2025-05-07T20:32:54.5020074Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5020167Z def test_silu_mul_quant( 2025-05-07T20:32:54.5020250Z self, 2025-05-07T20:32:54.5020325Z T: int, 2025-05-07T20:32:54.5020400Z D: int, 2025-05-07T20:32:54.5020502Z scale_ub: Optional[float], 2025-05-07T20:32:54.5020590Z contiguous: bool, 2025-05-07T20:32:54.5020679Z compiled: bool, 2025-05-07T20:32:54.5020761Z ) -> None: 2025-05-07T20:32:54.5020854Z torch.manual_seed(2025) 2025-05-07T20:32:54.5020929Z 2025-05-07T20:32:54.5021098Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5021215Z 2025-05-07T20:32:54.5021313Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5021437Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5021528Z x = x_sign * x_clamp 2025-05-07T20:32:54.5021616Z x0 = x[:, :D] 2025-05-07T20:32:54.5021698Z x1 = x[:, D:] 2025-05-07T20:32:54.5021769Z 2025-05-07T20:32:54.5021860Z if contiguous: 2025-05-07T20:32:54.5021952Z x0 = x0.contiguous() 2025-05-07T20:32:54.5022039Z x1 = x1.contiguous() 2025-05-07T20:32:54.5022118Z 2025-05-07T20:32:54.5022211Z if scale_ub is not None: 2025-05-07T20:32:54.5022318Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5022463Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5022537Z ) 2025-05-07T20:32:54.5022620Z else: 2025-05-07T20:32:54.5022715Z scale_ub_tensor = None 2025-05-07T20:32:54.5022789Z 2025-05-07T20:32:54.5022925Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5023016Z op = silu_mul_quant 2025-05-07T20:32:54.5023100Z if compiled: 2025-05-07T20:32:54.5023208Z op = torch.compile(op) 2025-05-07T20:32:54.5023314Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5023388Z 2025-05-07T20:32:54.5023488Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.5023609Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.5023681Z 2025-05-07T20:32:54.5023823Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5023926Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.5024034Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.5024159Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.5024299Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.5024381Z 2025-05-07T20:32:54.5024481Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.5024486Z 2025-05-07T20:32:54.5024585Z moe/activation_test.py:126: 2025-05-07T20:32:54.5024778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5024885Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.5025026Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.5025604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.5025706Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.5026080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5026349Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5026725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.5026993Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.5027384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.5027560Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.5027949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.5028027Z fn() 2025-05-07T20:32:54.5028447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.5028530Z self.fn.run( 2025-05-07T20:32:54.5028922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5029026Z kernel = self.compile( 2025-05-07T20:32:54.5029415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5029666Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5029797Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5029802Z 2025-05-07T20:32:54.5030010Z self = 2025-05-07T20:32:54.5030822Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5031340Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0d92cb80>} 2025-05-07T20:32:54.5032118Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5032317Z context = 2025-05-07T20:32:54.5032321Z 2025-05-07T20:32:54.5032495Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5032766Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5032871Z module_map=module_map) 2025-05-07T20:32:54.5033039Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5033139Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.5033214Z E ^ 2025-05-07T20:32:54.5033586Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5033590Z 2025-05-07T20:32:54.5034015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5034022Z 2025-05-07T20:32:54.5034131Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5034402Z self=, 2025-05-07T20:32:54.5034480Z T=1, 2025-05-07T20:32:54.5034561Z D=5120, 2025-05-07T20:32:54.5034645Z scale_ub=1200.0, 2025-05-07T20:32:54.5034733Z contiguous=False, 2025-05-07T20:32:54.5034825Z compiled=True, 2025-05-07T20:32:54.5034897Z ) 2025-05-07T20:32:54.5035128Z self = 2025-05-07T20:32:54.5035297Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.5035302Z 2025-05-07T20:32:54.5035420Z @given( 2025-05-07T20:32:54.5035547Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5035645Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5035759Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5035885Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5035998Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5036071Z ) 2025-05-07T20:32:54.5036330Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5036423Z def test_silu_mul_quant( 2025-05-07T20:32:54.5036505Z self, 2025-05-07T20:32:54.5036622Z T: int, 2025-05-07T20:32:54.5036700Z D: int, 2025-05-07T20:32:54.5036802Z scale_ub: Optional[float], 2025-05-07T20:32:54.5036892Z contiguous: bool, 2025-05-07T20:32:54.5036977Z compiled: bool, 2025-05-07T20:32:54.5037058Z ) -> None: 2025-05-07T20:32:54.5037152Z torch.manual_seed(2025) 2025-05-07T20:32:54.5037225Z 2025-05-07T20:32:54.5037402Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5037475Z 2025-05-07T20:32:54.5037566Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5037740Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5037828Z x = x_sign * x_clamp 2025-05-07T20:32:54.5037915Z x0 = x[:, :D] 2025-05-07T20:32:54.5037994Z x1 = x[:, D:] 2025-05-07T20:32:54.5038068Z 2025-05-07T20:32:54.5038158Z if contiguous: 2025-05-07T20:32:54.5038249Z x0 = x0.contiguous() 2025-05-07T20:32:54.5038340Z x1 = x1.contiguous() 2025-05-07T20:32:54.5038418Z 2025-05-07T20:32:54.5038509Z if scale_ub is not None: 2025-05-07T20:32:54.5038614Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5038756Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5038831Z ) 2025-05-07T20:32:54.5038917Z else: 2025-05-07T20:32:54.5039034Z scale_ub_tensor = None 2025-05-07T20:32:54.5039115Z 2025-05-07T20:32:54.5039258Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5039369Z op = silu_mul_quant 2025-05-07T20:32:54.5039459Z if compiled: 2025-05-07T20:32:54.5039566Z op = torch.compile(op) 2025-05-07T20:32:54.5039674Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5039745Z 2025-05-07T20:32:54.5039845Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5039849Z 2025-05-07T20:32:54.5039949Z moe/activation_test.py:117: 2025-05-07T20:32:54.5040085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5040194Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5040295Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5040670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.5040774Z return fn(*args, **kwargs) 2025-05-07T20:32:54.5041280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5041390Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5041756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5042034Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5042393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5042497Z kernel = self.compile( 2025-05-07T20:32:54.5042900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5043079Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5043210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5043256Z 2025-05-07T20:32:54.5043473Z self = 2025-05-07T20:32:54.5044277Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5044807Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0d92de40>} 2025-05-07T20:32:54.5045619Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5045815Z context = 2025-05-07T20:32:54.5045822Z 2025-05-07T20:32:54.5045996Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5046268Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5046424Z module_map=module_map) 2025-05-07T20:32:54.5046587Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5046686Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5046774Z E ^ 2025-05-07T20:32:54.5047139Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5047146Z 2025-05-07T20:32:54.5047576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5047580Z 2025-05-07T20:32:54.5047684Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5047914Z self=, 2025-05-07T20:32:54.5047999Z T=1, 2025-05-07T20:32:54.5048076Z D=5120, 2025-05-07T20:32:54.5048160Z scale_ub=1200.0, 2025-05-07T20:32:54.5048252Z contiguous=False, 2025-05-07T20:32:54.5048336Z compiled=False, 2025-05-07T20:32:54.5048410Z ) 2025-05-07T20:32:54.5048639Z self = 2025-05-07T20:32:54.5048810Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.5048816Z 2025-05-07T20:32:54.5048899Z @given( 2025-05-07T20:32:54.5049020Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5049124Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5049244Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5049362Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5049480Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5049559Z ) 2025-05-07T20:32:54.5049813Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5049905Z def test_silu_mul_quant( 2025-05-07T20:32:54.5049989Z self, 2025-05-07T20:32:54.5050066Z T: int, 2025-05-07T20:32:54.5050149Z D: int, 2025-05-07T20:32:54.5050247Z scale_ub: Optional[float], 2025-05-07T20:32:54.5050337Z contiguous: bool, 2025-05-07T20:32:54.5050430Z compiled: bool, 2025-05-07T20:32:54.5050507Z ) -> None: 2025-05-07T20:32:54.5050652Z torch.manual_seed(2025) 2025-05-07T20:32:54.5050735Z 2025-05-07T20:32:54.5050908Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5050986Z 2025-05-07T20:32:54.5051084Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5051209Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5051297Z x = x_sign * x_clamp 2025-05-07T20:32:54.5051384Z x0 = x[:, :D] 2025-05-07T20:32:54.5051463Z x1 = x[:, D:] 2025-05-07T20:32:54.5051578Z 2025-05-07T20:32:54.5051670Z if contiguous: 2025-05-07T20:32:54.5051761Z x0 = x0.contiguous() 2025-05-07T20:32:54.5051957Z x1 = x1.contiguous() 2025-05-07T20:32:54.5052033Z 2025-05-07T20:32:54.5052126Z if scale_ub is not None: 2025-05-07T20:32:54.5052238Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5052373Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5052452Z ) 2025-05-07T20:32:54.5052536Z else: 2025-05-07T20:32:54.5052630Z scale_ub_tensor = None 2025-05-07T20:32:54.5052702Z 2025-05-07T20:32:54.5052887Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5052979Z op = silu_mul_quant 2025-05-07T20:32:54.5053066Z if compiled: 2025-05-07T20:32:54.5053172Z op = torch.compile(op) 2025-05-07T20:32:54.5053281Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5053361Z 2025-05-07T20:32:54.5053452Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5053457Z 2025-05-07T20:32:54.5053555Z moe/activation_test.py:117: 2025-05-07T20:32:54.5053693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5053842Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5053942Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5054470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5054568Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5054947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5055174Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5055526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5055630Z kernel = self.compile( 2025-05-07T20:32:54.5056024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5056203Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5056342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5056346Z 2025-05-07T20:32:54.5056559Z self = 2025-05-07T20:32:54.5057376Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5057897Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0d92eac0>} 2025-05-07T20:32:54.5058725Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5058924Z context = 2025-05-07T20:32:54.5058928Z 2025-05-07T20:32:54.5059098Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5059421Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5059531Z module_map=module_map) 2025-05-07T20:32:54.5059703Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5059802Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5059879Z E ^ 2025-05-07T20:32:54.5060250Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5060399Z 2025-05-07T20:32:54.5060826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5060831Z 2025-05-07T20:32:54.5060934Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5061175Z self=, 2025-05-07T20:32:54.5061253Z T=16384, 2025-05-07T20:32:54.5061335Z D=5120, 2025-05-07T20:32:54.5061421Z scale_ub=1200.0, 2025-05-07T20:32:54.5061505Z contiguous=False, 2025-05-07T20:32:54.5061594Z compiled=True, 2025-05-07T20:32:54.5061666Z ) 2025-05-07T20:32:54.5061967Z self = 2025-05-07T20:32:54.5062160Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.5062164Z 2025-05-07T20:32:54.5062242Z @given( 2025-05-07T20:32:54.5062362Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5062471Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5062585Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5062710Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5062821Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5062936Z ) 2025-05-07T20:32:54.5063194Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5063286Z def test_silu_mul_quant( 2025-05-07T20:32:54.5063366Z self, 2025-05-07T20:32:54.5063465Z T: int, 2025-05-07T20:32:54.5063543Z D: int, 2025-05-07T20:32:54.5063643Z scale_ub: Optional[float], 2025-05-07T20:32:54.5063739Z contiguous: bool, 2025-05-07T20:32:54.5063826Z compiled: bool, 2025-05-07T20:32:54.5063905Z ) -> None: 2025-05-07T20:32:54.5064007Z torch.manual_seed(2025) 2025-05-07T20:32:54.5064080Z 2025-05-07T20:32:54.5064253Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5064336Z 2025-05-07T20:32:54.5064428Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5064553Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5064648Z x = x_sign * x_clamp 2025-05-07T20:32:54.5064732Z x0 = x[:, :D] 2025-05-07T20:32:54.5064813Z x1 = x[:, D:] 2025-05-07T20:32:54.5064891Z 2025-05-07T20:32:54.5064973Z if contiguous: 2025-05-07T20:32:54.5065074Z x0 = x0.contiguous() 2025-05-07T20:32:54.5065165Z x1 = x1.contiguous() 2025-05-07T20:32:54.5065238Z 2025-05-07T20:32:54.5065337Z if scale_ub is not None: 2025-05-07T20:32:54.5065450Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5065588Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5065672Z ) 2025-05-07T20:32:54.5065748Z else: 2025-05-07T20:32:54.5065842Z scale_ub_tensor = None 2025-05-07T20:32:54.5065923Z 2025-05-07T20:32:54.5066060Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5066150Z op = silu_mul_quant 2025-05-07T20:32:54.5066245Z if compiled: 2025-05-07T20:32:54.5066344Z op = torch.compile(op) 2025-05-07T20:32:54.5066462Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5066534Z 2025-05-07T20:32:54.5066626Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5066631Z 2025-05-07T20:32:54.5066786Z moe/activation_test.py:117: 2025-05-07T20:32:54.5066919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5067020Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5067132Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5067509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.5067601Z return fn(*args, **kwargs) 2025-05-07T20:32:54.5068118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5068261Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5068634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5068890Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5069267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5069368Z kernel = self.compile( 2025-05-07T20:32:54.5069806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5069992Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5070122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5070127Z 2025-05-07T20:32:54.5070338Z self = 2025-05-07T20:32:54.5071146Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5071709Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c668180>} 2025-05-07T20:32:54.5072488Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5072683Z context = 2025-05-07T20:32:54.5072687Z 2025-05-07T20:32:54.5072855Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5073135Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5073241Z module_map=module_map) 2025-05-07T20:32:54.5073410Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5073516Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5073592Z E ^ 2025-05-07T20:32:54.5073969Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5073973Z 2025-05-07T20:32:54.5074405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5074409Z 2025-05-07T20:32:54.5074520Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5074751Z self=, 2025-05-07T20:32:54.5074827Z T=2048, 2025-05-07T20:32:54.5074910Z D=7168, 2025-05-07T20:32:54.5074997Z scale_ub=1200.0, 2025-05-07T20:32:54.5075084Z contiguous=False, 2025-05-07T20:32:54.5075172Z compiled=True, 2025-05-07T20:32:54.5075247Z ) 2025-05-07T20:32:54.5075470Z self = 2025-05-07T20:32:54.5075658Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.5075662Z 2025-05-07T20:32:54.5075741Z @given( 2025-05-07T20:32:54.5075915Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5076017Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5076134Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5076259Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5076375Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5076449Z ) 2025-05-07T20:32:54.5076708Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5076801Z def test_silu_mul_quant( 2025-05-07T20:32:54.5076921Z self, 2025-05-07T20:32:54.5077005Z T: int, 2025-05-07T20:32:54.5077082Z D: int, 2025-05-07T20:32:54.5077188Z scale_ub: Optional[float], 2025-05-07T20:32:54.5077279Z contiguous: bool, 2025-05-07T20:32:54.5077365Z compiled: bool, 2025-05-07T20:32:54.5077451Z ) -> None: 2025-05-07T20:32:54.5077545Z torch.manual_seed(2025) 2025-05-07T20:32:54.5077618Z 2025-05-07T20:32:54.5077797Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5077872Z 2025-05-07T20:32:54.5077964Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5078140Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5078233Z x = x_sign * x_clamp 2025-05-07T20:32:54.5078312Z x0 = x[:, :D] 2025-05-07T20:32:54.5078403Z x1 = x[:, D:] 2025-05-07T20:32:54.5078477Z 2025-05-07T20:32:54.5078559Z if contiguous: 2025-05-07T20:32:54.5078663Z x0 = x0.contiguous() 2025-05-07T20:32:54.5078757Z x1 = x1.contiguous() 2025-05-07T20:32:54.5078840Z 2025-05-07T20:32:54.5078945Z if scale_ub is not None: 2025-05-07T20:32:54.5079062Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5079274Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5079350Z ) 2025-05-07T20:32:54.5079426Z else: 2025-05-07T20:32:54.5079526Z scale_ub_tensor = None 2025-05-07T20:32:54.5079602Z 2025-05-07T20:32:54.5079733Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5079831Z op = silu_mul_quant 2025-05-07T20:32:54.5079918Z if compiled: 2025-05-07T20:32:54.5080018Z op = torch.compile(op) 2025-05-07T20:32:54.5080130Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5080205Z 2025-05-07T20:32:54.5080303Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5080308Z 2025-05-07T20:32:54.5080410Z moe/activation_test.py:117: 2025-05-07T20:32:54.5080543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5080650Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5080749Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5081129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.5081229Z return fn(*args, **kwargs) 2025-05-07T20:32:54.5081742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5081850Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5082219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5082446Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5082800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5082896Z kernel = self.compile( 2025-05-07T20:32:54.5083290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5083478Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5083610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5083664Z 2025-05-07T20:32:54.5083881Z self = 2025-05-07T20:32:54.5084689Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5085210Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c668ea0>} 2025-05-07T20:32:54.5086026Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5086223Z context = 2025-05-07T20:32:54.5086228Z 2025-05-07T20:32:54.5086404Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5086678Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5086828Z module_map=module_map) 2025-05-07T20:32:54.5086994Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5087094Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5087176Z E ^ 2025-05-07T20:32:54.5087541Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5087548Z 2025-05-07T20:32:54.5087974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5088020Z 2025-05-07T20:32:54.5088130Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5088359Z self=, 2025-05-07T20:32:54.5088441Z T=1, 2025-05-07T20:32:54.5088519Z D=5120, 2025-05-07T20:32:54.5088601Z scale_ub=None, 2025-05-07T20:32:54.5088694Z contiguous=False, 2025-05-07T20:32:54.5088777Z compiled=False, 2025-05-07T20:32:54.5088855Z ) 2025-05-07T20:32:54.5089085Z self = 2025-05-07T20:32:54.5089255Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.5089259Z 2025-05-07T20:32:54.5089338Z @given( 2025-05-07T20:32:54.5089464Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5089564Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5089685Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5089807Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5089924Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5090004Z ) 2025-05-07T20:32:54.5090256Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5090351Z def test_silu_mul_quant( 2025-05-07T20:32:54.5090432Z self, 2025-05-07T20:32:54.5090507Z T: int, 2025-05-07T20:32:54.5090586Z D: int, 2025-05-07T20:32:54.5090691Z scale_ub: Optional[float], 2025-05-07T20:32:54.5090780Z contiguous: bool, 2025-05-07T20:32:54.5090865Z compiled: bool, 2025-05-07T20:32:54.5090949Z ) -> None: 2025-05-07T20:32:54.5091042Z torch.manual_seed(2025) 2025-05-07T20:32:54.5091119Z 2025-05-07T20:32:54.5091293Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5091367Z 2025-05-07T20:32:54.5091466Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5091591Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5091683Z x = x_sign * x_clamp 2025-05-07T20:32:54.5091769Z x0 = x[:, :D] 2025-05-07T20:32:54.5091958Z x1 = x[:, D:] 2025-05-07T20:32:54.5092031Z 2025-05-07T20:32:54.5092173Z if contiguous: 2025-05-07T20:32:54.5092267Z x0 = x0.contiguous() 2025-05-07T20:32:54.5092357Z x1 = x1.contiguous() 2025-05-07T20:32:54.5092436Z 2025-05-07T20:32:54.5092528Z if scale_ub is not None: 2025-05-07T20:32:54.5092640Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5092782Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5092858Z ) 2025-05-07T20:32:54.5092940Z else: 2025-05-07T20:32:54.5093034Z scale_ub_tensor = None 2025-05-07T20:32:54.5093186Z 2025-05-07T20:32:54.5093325Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5093416Z op = silu_mul_quant 2025-05-07T20:32:54.5093502Z if compiled: 2025-05-07T20:32:54.5093614Z op = torch.compile(op) 2025-05-07T20:32:54.5093719Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5093790Z 2025-05-07T20:32:54.5093887Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5093895Z 2025-05-07T20:32:54.5093992Z moe/activation_test.py:117: 2025-05-07T20:32:54.5094130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5094272Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5094373Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5094898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5094995Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5095368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5095601Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5095991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5096093Z kernel = self.compile( 2025-05-07T20:32:54.5096491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5096671Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5096808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5096812Z 2025-05-07T20:32:54.5097021Z self = 2025-05-07T20:32:54.5097835Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5098356Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c669e40>} 2025-05-07T20:32:54.5099132Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5099339Z context = 2025-05-07T20:32:54.5099343Z 2025-05-07T20:32:54.5099513Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5099792Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5099902Z module_map=module_map) 2025-05-07T20:32:54.5100067Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5100172Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5100254Z E ^ 2025-05-07T20:32:54.5100624Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5100629Z 2025-05-07T20:32:54.5101103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5101108Z 2025-05-07T20:32:54.5101212Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5101449Z self=, 2025-05-07T20:32:54.5101526Z T=4096, 2025-05-07T20:32:54.5101600Z D=7168, 2025-05-07T20:32:54.5101689Z scale_ub=1200.0, 2025-05-07T20:32:54.5101776Z contiguous=False, 2025-05-07T20:32:54.5101865Z compiled=False, 2025-05-07T20:32:54.5101981Z ) 2025-05-07T20:32:54.5102205Z self = 2025-05-07T20:32:54.5102392Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.5102399Z 2025-05-07T20:32:54.5102477Z @given( 2025-05-07T20:32:54.5102598Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5102702Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5102819Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5102937Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5103109Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5103184Z ) 2025-05-07T20:32:54.5103443Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5103536Z def test_silu_mul_quant( 2025-05-07T20:32:54.5103612Z self, 2025-05-07T20:32:54.5103694Z T: int, 2025-05-07T20:32:54.5103776Z D: int, 2025-05-07T20:32:54.5103873Z scale_ub: Optional[float], 2025-05-07T20:32:54.5103972Z contiguous: bool, 2025-05-07T20:32:54.5104057Z compiled: bool, 2025-05-07T20:32:54.5104134Z ) -> None: 2025-05-07T20:32:54.5104279Z torch.manual_seed(2025) 2025-05-07T20:32:54.5104352Z 2025-05-07T20:32:54.5104523Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5104604Z 2025-05-07T20:32:54.5104699Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5104831Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5104921Z x = x_sign * x_clamp 2025-05-07T20:32:54.5105003Z x0 = x[:, :D] 2025-05-07T20:32:54.5105090Z x1 = x[:, D:] 2025-05-07T20:32:54.5105162Z 2025-05-07T20:32:54.5105245Z if contiguous: 2025-05-07T20:32:54.5105344Z x0 = x0.contiguous() 2025-05-07T20:32:54.5105432Z x1 = x1.contiguous() 2025-05-07T20:32:54.5105504Z 2025-05-07T20:32:54.5105604Z if scale_ub is not None: 2025-05-07T20:32:54.5105709Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5105846Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5105931Z ) 2025-05-07T20:32:54.5106008Z else: 2025-05-07T20:32:54.5106101Z scale_ub_tensor = None 2025-05-07T20:32:54.5106455Z 2025-05-07T20:32:54.5106641Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5106750Z op = silu_mul_quant 2025-05-07T20:32:54.5106838Z if compiled: 2025-05-07T20:32:54.5106938Z op = torch.compile(op) 2025-05-07T20:32:54.5107050Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5107124Z 2025-05-07T20:32:54.5107214Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5107219Z 2025-05-07T20:32:54.5107323Z moe/activation_test.py:117: 2025-05-07T20:32:54.5107458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5107561Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5107669Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5108183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5108290Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5108829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5109101Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5109477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5109572Z kernel = self.compile( 2025-05-07T20:32:54.5109978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5110157Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5110362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5110367Z 2025-05-07T20:32:54.5110585Z self = 2025-05-07T20:32:54.5111400Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5111994Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c66b380>} 2025-05-07T20:32:54.5112770Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5112971Z context = 2025-05-07T20:32:54.5112975Z 2025-05-07T20:32:54.5113151Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5113423Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5113600Z module_map=module_map) 2025-05-07T20:32:54.5113767Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5113869Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5113954Z E ^ 2025-05-07T20:32:54.5114327Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5114332Z 2025-05-07T20:32:54.5114761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5114772Z 2025-05-07T20:32:54.5114876Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5115111Z self=, 2025-05-07T20:32:54.5115199Z T=16384, 2025-05-07T20:32:54.5115278Z D=7168, 2025-05-07T20:32:54.5115362Z scale_ub=None, 2025-05-07T20:32:54.5115456Z contiguous=True, 2025-05-07T20:32:54.5115540Z compiled=True, 2025-05-07T20:32:54.5115614Z ) 2025-05-07T20:32:54.5115846Z self = 2025-05-07T20:32:54.5116027Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.5116031Z 2025-05-07T20:32:54.5116119Z @given( 2025-05-07T20:32:54.5116241Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5116339Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5116461Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5116578Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5116691Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5116778Z ) 2025-05-07T20:32:54.5117029Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5117121Z def test_silu_mul_quant( 2025-05-07T20:32:54.5117205Z self, 2025-05-07T20:32:54.5117281Z T: int, 2025-05-07T20:32:54.5117356Z D: int, 2025-05-07T20:32:54.5117462Z scale_ub: Optional[float], 2025-05-07T20:32:54.5117550Z contiguous: bool, 2025-05-07T20:32:54.5117694Z compiled: bool, 2025-05-07T20:32:54.5117774Z ) -> None: 2025-05-07T20:32:54.5117868Z torch.manual_seed(2025) 2025-05-07T20:32:54.5117947Z 2025-05-07T20:32:54.5118119Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5118191Z 2025-05-07T20:32:54.5118288Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5118412Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5118500Z x = x_sign * x_clamp 2025-05-07T20:32:54.5118631Z x0 = x[:, :D] 2025-05-07T20:32:54.5118709Z x1 = x[:, D:] 2025-05-07T20:32:54.5118780Z 2025-05-07T20:32:54.5118871Z if contiguous: 2025-05-07T20:32:54.5118963Z x0 = x0.contiguous() 2025-05-07T20:32:54.5119060Z x1 = x1.contiguous() 2025-05-07T20:32:54.5119135Z 2025-05-07T20:32:54.5119226Z if scale_ub is not None: 2025-05-07T20:32:54.5119337Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5119477Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5119551Z ) 2025-05-07T20:32:54.5119632Z else: 2025-05-07T20:32:54.5119770Z scale_ub_tensor = None 2025-05-07T20:32:54.5119844Z 2025-05-07T20:32:54.5119980Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5120069Z op = silu_mul_quant 2025-05-07T20:32:54.5120154Z if compiled: 2025-05-07T20:32:54.5120260Z op = torch.compile(op) 2025-05-07T20:32:54.5120367Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5120449Z 2025-05-07T20:32:54.5120544Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5120548Z 2025-05-07T20:32:54.5120646Z moe/activation_test.py:117: 2025-05-07T20:32:54.5120832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5120931Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5121032Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5121420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.5121513Z return fn(*args, **kwargs) 2025-05-07T20:32:54.5122026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5122129Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5122498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5122733Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5123085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5123181Z kernel = self.compile( 2025-05-07T20:32:54.5123583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5123764Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5123924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5123929Z 2025-05-07T20:32:54.5124140Z self = 2025-05-07T20:32:54.5124943Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5125477Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c0e04a0>} 2025-05-07T20:32:54.5126332Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5132191Z context = 2025-05-07T20:32:54.5132200Z 2025-05-07T20:32:54.5132400Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5132679Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5132787Z module_map=module_map) 2025-05-07T20:32:54.5132949Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5133137Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5133213Z E ^ 2025-05-07T20:32:54.5133586Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5133594Z 2025-05-07T20:32:54.5134022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5134027Z 2025-05-07T20:32:54.5134135Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5134368Z self=, 2025-05-07T20:32:54.5134495Z T=4096, 2025-05-07T20:32:54.5134572Z D=5120, 2025-05-07T20:32:54.5134660Z scale_ub=None, 2025-05-07T20:32:54.5134748Z contiguous=False, 2025-05-07T20:32:54.5134836Z compiled=True, 2025-05-07T20:32:54.5134911Z ) 2025-05-07T20:32:54.5135134Z self = 2025-05-07T20:32:54.5135319Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.5135323Z 2025-05-07T20:32:54.5135401Z @given( 2025-05-07T20:32:54.5135521Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5135670Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5135784Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5135900Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5136025Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5136099Z ) 2025-05-07T20:32:54.5136362Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5136456Z def test_silu_mul_quant( 2025-05-07T20:32:54.5136531Z self, 2025-05-07T20:32:54.5136612Z T: int, 2025-05-07T20:32:54.5136687Z D: int, 2025-05-07T20:32:54.5136785Z scale_ub: Optional[float], 2025-05-07T20:32:54.5136882Z contiguous: bool, 2025-05-07T20:32:54.5136972Z compiled: bool, 2025-05-07T20:32:54.5137051Z ) -> None: 2025-05-07T20:32:54.5137152Z torch.manual_seed(2025) 2025-05-07T20:32:54.5137225Z 2025-05-07T20:32:54.5137400Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5137484Z 2025-05-07T20:32:54.5137577Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5137708Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5137801Z x = x_sign * x_clamp 2025-05-07T20:32:54.5137881Z x0 = x[:, :D] 2025-05-07T20:32:54.5137969Z x1 = x[:, D:] 2025-05-07T20:32:54.5138041Z 2025-05-07T20:32:54.5138128Z if contiguous: 2025-05-07T20:32:54.5138224Z x0 = x0.contiguous() 2025-05-07T20:32:54.5138315Z x1 = x1.contiguous() 2025-05-07T20:32:54.5138389Z 2025-05-07T20:32:54.5138488Z if scale_ub is not None: 2025-05-07T20:32:54.5138595Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5138735Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5138819Z ) 2025-05-07T20:32:54.5138895Z else: 2025-05-07T20:32:54.5138989Z scale_ub_tensor = None 2025-05-07T20:32:54.5139072Z 2025-05-07T20:32:54.5139201Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5139298Z op = silu_mul_quant 2025-05-07T20:32:54.5139383Z if compiled: 2025-05-07T20:32:54.5139535Z op = torch.compile(op) 2025-05-07T20:32:54.5139653Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5139728Z 2025-05-07T20:32:54.5139823Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5139828Z 2025-05-07T20:32:54.5139932Z moe/activation_test.py:117: 2025-05-07T20:32:54.5140066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5140167Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5140275Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5140700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.5140803Z return fn(*args, **kwargs) 2025-05-07T20:32:54.5141311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5141411Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5141786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5142058Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5142417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5142512Z kernel = self.compile( 2025-05-07T20:32:54.5142905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5143092Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5143223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5143273Z 2025-05-07T20:32:54.5143483Z self = 2025-05-07T20:32:54.5144300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5144822Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c0e11c0>} 2025-05-07T20:32:54.5145601Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5145800Z context = 2025-05-07T20:32:54.5145804Z 2025-05-07T20:32:54.5145978Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5146251Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5146359Z module_map=module_map) 2025-05-07T20:32:54.5146531Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5146631Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5146709Z E ^ 2025-05-07T20:32:54.5147083Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5147088Z 2025-05-07T20:32:54.5147514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5147522Z 2025-05-07T20:32:54.5147631Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5147860Z self=, 2025-05-07T20:32:54.5147937Z T=4096, 2025-05-07T20:32:54.5148025Z D=5120, 2025-05-07T20:32:54.5148109Z scale_ub=1200.0, 2025-05-07T20:32:54.5148195Z contiguous=False, 2025-05-07T20:32:54.5148305Z compiled=False, 2025-05-07T20:32:54.5148388Z ) 2025-05-07T20:32:54.5148690Z self = 2025-05-07T20:32:54.5148875Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.5148880Z 2025-05-07T20:32:54.5148959Z @given( 2025-05-07T20:32:54.5149082Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5149181Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5149295Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5149420Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5149575Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5149649Z ) 2025-05-07T20:32:54.5149907Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5150004Z def test_silu_mul_quant( 2025-05-07T20:32:54.5150084Z self, 2025-05-07T20:32:54.5150161Z T: int, 2025-05-07T20:32:54.5150238Z D: int, 2025-05-07T20:32:54.5150340Z scale_ub: Optional[float], 2025-05-07T20:32:54.5150433Z contiguous: bool, 2025-05-07T20:32:54.5150518Z compiled: bool, 2025-05-07T20:32:54.5150602Z ) -> None: 2025-05-07T20:32:54.5150738Z torch.manual_seed(2025) 2025-05-07T20:32:54.5150814Z 2025-05-07T20:32:54.5150995Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5151071Z 2025-05-07T20:32:54.5151161Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5151293Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5151385Z x = x_sign * x_clamp 2025-05-07T20:32:54.5151471Z x0 = x[:, :D] 2025-05-07T20:32:54.5151552Z x1 = x[:, D:] 2025-05-07T20:32:54.5151624Z 2025-05-07T20:32:54.5151715Z if contiguous: 2025-05-07T20:32:54.5151850Z x0 = x0.contiguous() 2025-05-07T20:32:54.5151940Z x1 = x1.contiguous() 2025-05-07T20:32:54.5152017Z 2025-05-07T20:32:54.5152108Z if scale_ub is not None: 2025-05-07T20:32:54.5152218Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5152364Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5152443Z ) 2025-05-07T20:32:54.5152523Z else: 2025-05-07T20:32:54.5152623Z scale_ub_tensor = None 2025-05-07T20:32:54.5152696Z 2025-05-07T20:32:54.5152828Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5152925Z op = silu_mul_quant 2025-05-07T20:32:54.5153009Z if compiled: 2025-05-07T20:32:54.5153120Z op = torch.compile(op) 2025-05-07T20:32:54.5153224Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5153295Z 2025-05-07T20:32:54.5153392Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5153400Z 2025-05-07T20:32:54.5153499Z moe/activation_test.py:117: 2025-05-07T20:32:54.5153632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5153740Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5153843Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5154367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5154466Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5154836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5155071Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5155422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5155517Z kernel = self.compile( 2025-05-07T20:32:54.5155921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5156097Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5156284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5156289Z 2025-05-07T20:32:54.5156504Z self = 2025-05-07T20:32:54.5157309Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5157834Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c0e2160>} 2025-05-07T20:32:54.5158672Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5158877Z context = 2025-05-07T20:32:54.5158885Z 2025-05-07T20:32:54.5159069Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5159414Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5159528Z module_map=module_map) 2025-05-07T20:32:54.5159692Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5159797Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5159871Z E ^ 2025-05-07T20:32:54.5160236Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5160240Z 2025-05-07T20:32:54.5160671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5160747Z 2025-05-07T20:32:54.5160853Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5161091Z self=, 2025-05-07T20:32:54.5161167Z T=4096, 2025-05-07T20:32:54.5161243Z D=5120, 2025-05-07T20:32:54.5161331Z scale_ub=1200.0, 2025-05-07T20:32:54.5161420Z contiguous=False, 2025-05-07T20:32:54.5161504Z compiled=True, 2025-05-07T20:32:54.5161584Z ) 2025-05-07T20:32:54.5161809Z self = 2025-05-07T20:32:54.5161989Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.5161997Z 2025-05-07T20:32:54.5162082Z @given( 2025-05-07T20:32:54.5162203Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5162307Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5162423Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5162543Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5162663Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5162739Z ) 2025-05-07T20:32:54.5162993Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5163096Z def test_silu_mul_quant( 2025-05-07T20:32:54.5163176Z self, 2025-05-07T20:32:54.5163254Z T: int, 2025-05-07T20:32:54.5163338Z D: int, 2025-05-07T20:32:54.5163436Z scale_ub: Optional[float], 2025-05-07T20:32:54.5163526Z contiguous: bool, 2025-05-07T20:32:54.5163618Z compiled: bool, 2025-05-07T20:32:54.5163695Z ) -> None: 2025-05-07T20:32:54.5163796Z torch.manual_seed(2025) 2025-05-07T20:32:54.5163868Z 2025-05-07T20:32:54.5164038Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5164117Z 2025-05-07T20:32:54.5164208Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5164340Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5164434Z x = x_sign * x_clamp 2025-05-07T20:32:54.5164513Z x0 = x[:, :D] 2025-05-07T20:32:54.5164640Z x1 = x[:, D:] 2025-05-07T20:32:54.5164720Z 2025-05-07T20:32:54.5164810Z if contiguous: 2025-05-07T20:32:54.5164901Z x0 = x0.contiguous() 2025-05-07T20:32:54.5165000Z x1 = x1.contiguous() 2025-05-07T20:32:54.5165071Z 2025-05-07T20:32:54.5165167Z if scale_ub is not None: 2025-05-07T20:32:54.5165272Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5165409Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5165491Z ) 2025-05-07T20:32:54.5165611Z else: 2025-05-07T20:32:54.5165706Z scale_ub_tensor = None 2025-05-07T20:32:54.5165784Z 2025-05-07T20:32:54.5165915Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5166009Z op = silu_mul_quant 2025-05-07T20:32:54.5166101Z if compiled: 2025-05-07T20:32:54.5166201Z op = torch.compile(op) 2025-05-07T20:32:54.5166309Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5166393Z 2025-05-07T20:32:54.5166484Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5166488Z 2025-05-07T20:32:54.5166594Z moe/activation_test.py:117: 2025-05-07T20:32:54.5166771Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5166872Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5166979Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5167356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.5167450Z return fn(*args, **kwargs) 2025-05-07T20:32:54.5167969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5168106Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5168481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5168712Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5169064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5169166Z kernel = self.compile( 2025-05-07T20:32:54.5169559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5169737Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5169878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5169883Z 2025-05-07T20:32:54.5170094Z self = 2025-05-07T20:32:54.5170907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5171431Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c0e3240>} 2025-05-07T20:32:54.5172322Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5172517Z context = 2025-05-07T20:32:54.5172524Z 2025-05-07T20:32:54.5172692Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5172971Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5173083Z module_map=module_map) 2025-05-07T20:32:54.5173256Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5173400Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5173480Z E ^ 2025-05-07T20:32:54.5173918Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5173922Z 2025-05-07T20:32:54.5174421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5174425Z 2025-05-07T20:32:54.5174534Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5174799Z self=, 2025-05-07T20:32:54.5174920Z T=2048, 2025-05-07T20:32:54.5175003Z D=7168, 2025-05-07T20:32:54.5175088Z scale_ub=1200.0, 2025-05-07T20:32:54.5175179Z contiguous=False, 2025-05-07T20:32:54.5175273Z compiled=False, 2025-05-07T20:32:54.5175347Z ) 2025-05-07T20:32:54.5175597Z self = 2025-05-07T20:32:54.5175803Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.5175808Z 2025-05-07T20:32:54.5175886Z @given( 2025-05-07T20:32:54.5176010Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5176162Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5176286Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5176416Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5176539Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5176617Z ) 2025-05-07T20:32:54.5176910Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5177006Z def test_silu_mul_quant( 2025-05-07T20:32:54.5177084Z self, 2025-05-07T20:32:54.5177213Z T: int, 2025-05-07T20:32:54.5177291Z D: int, 2025-05-07T20:32:54.5177392Z scale_ub: Optional[float], 2025-05-07T20:32:54.5177489Z contiguous: bool, 2025-05-07T20:32:54.5177578Z compiled: bool, 2025-05-07T20:32:54.5177660Z ) -> None: 2025-05-07T20:32:54.5177765Z torch.manual_seed(2025) 2025-05-07T20:32:54.5177840Z 2025-05-07T20:32:54.5178035Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5178109Z 2025-05-07T20:32:54.5178205Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5178343Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5178435Z x = x_sign * x_clamp 2025-05-07T20:32:54.5178518Z x0 = x[:, :D] 2025-05-07T20:32:54.5178609Z x1 = x[:, D:] 2025-05-07T20:32:54.5178683Z 2025-05-07T20:32:54.5178790Z if contiguous: 2025-05-07T20:32:54.5178899Z x0 = x0.contiguous() 2025-05-07T20:32:54.5179010Z x1 = x1.contiguous() 2025-05-07T20:32:54.5179087Z 2025-05-07T20:32:54.5179185Z if scale_ub is not None: 2025-05-07T20:32:54.5179294Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5179447Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5179524Z ) 2025-05-07T20:32:54.5179603Z else: 2025-05-07T20:32:54.5179708Z scale_ub_tensor = None 2025-05-07T20:32:54.5179781Z 2025-05-07T20:32:54.5179921Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5180022Z op = silu_mul_quant 2025-05-07T20:32:54.5180108Z if compiled: 2025-05-07T20:32:54.5180211Z op = torch.compile(op) 2025-05-07T20:32:54.5180328Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5180405Z 2025-05-07T20:32:54.5180501Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5180505Z 2025-05-07T20:32:54.5180615Z moe/activation_test.py:117: 2025-05-07T20:32:54.5180756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5180871Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5180975Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5181635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5181746Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5182179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5182437Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5182854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5182992Z kernel = self.compile( 2025-05-07T20:32:54.5183462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5183660Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5183798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5183804Z 2025-05-07T20:32:54.5184044Z self = 2025-05-07T20:32:54.5185058Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5185689Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c1a4220>} 2025-05-07T20:32:54.5186618Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5186885Z context = 2025-05-07T20:32:54.5186889Z 2025-05-07T20:32:54.5187074Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5187387Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5187513Z module_map=module_map) 2025-05-07T20:32:54.5187689Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5187793Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5187879Z E ^ 2025-05-07T20:32:54.5188306Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5188312Z 2025-05-07T20:32:54.5188819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5188824Z 2025-05-07T20:32:54.5188937Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5189193Z self=, 2025-05-07T20:32:54.5189280Z T=1, 2025-05-07T20:32:54.5189359Z D=7168, 2025-05-07T20:32:54.5189447Z scale_ub=None, 2025-05-07T20:32:54.5189544Z contiguous=True, 2025-05-07T20:32:54.5189631Z compiled=False, 2025-05-07T20:32:54.5189707Z ) 2025-05-07T20:32:54.5189964Z self = 2025-05-07T20:32:54.5190147Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.5190152Z 2025-05-07T20:32:54.5190236Z @given( 2025-05-07T20:32:54.5190361Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5190468Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5190595Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5190722Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5190846Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5190927Z ) 2025-05-07T20:32:54.5191214Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5191363Z def test_silu_mul_quant( 2025-05-07T20:32:54.5191444Z self, 2025-05-07T20:32:54.5191522Z T: int, 2025-05-07T20:32:54.5191605Z D: int, 2025-05-07T20:32:54.5191710Z scale_ub: Optional[float], 2025-05-07T20:32:54.5191805Z contiguous: bool, 2025-05-07T20:32:54.5191902Z compiled: bool, 2025-05-07T20:32:54.5191984Z ) -> None: 2025-05-07T20:32:54.5192082Z torch.manual_seed(2025) 2025-05-07T20:32:54.5192161Z 2025-05-07T20:32:54.5192348Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5192468Z 2025-05-07T20:32:54.5192568Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5192699Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5192798Z x = x_sign * x_clamp 2025-05-07T20:32:54.5192881Z x0 = x[:, :D] 2025-05-07T20:32:54.5192965Z x1 = x[:, D:] 2025-05-07T20:32:54.5193059Z 2025-05-07T20:32:54.5193153Z if contiguous: 2025-05-07T20:32:54.5193252Z x0 = x0.contiguous() 2025-05-07T20:32:54.5193345Z x1 = x1.contiguous() 2025-05-07T20:32:54.5193427Z 2025-05-07T20:32:54.5193523Z if scale_ub is not None: 2025-05-07T20:32:54.5193714Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5193860Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5193937Z ) 2025-05-07T20:32:54.5194018Z else: 2025-05-07T20:32:54.5194112Z scale_ub_tensor = None 2025-05-07T20:32:54.5194185Z 2025-05-07T20:32:54.5194326Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5194416Z op = silu_mul_quant 2025-05-07T20:32:54.5194501Z if compiled: 2025-05-07T20:32:54.5194608Z op = torch.compile(op) 2025-05-07T20:32:54.5194758Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5194831Z 2025-05-07T20:32:54.5194930Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5194934Z 2025-05-07T20:32:54.5195033Z moe/activation_test.py:117: 2025-05-07T20:32:54.5195172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5195275Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5195374Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5195895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5195994Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5196365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5196600Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5196954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5197055Z kernel = self.compile( 2025-05-07T20:32:54.5197451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5197628Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5197769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5197773Z 2025-05-07T20:32:54.5197982Z self = 2025-05-07T20:32:54.5198792Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5199315Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c1a5120>} 2025-05-07T20:32:54.5200269Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5200475Z context = 2025-05-07T20:32:54.5200479Z 2025-05-07T20:32:54.5200646Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5200921Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5201028Z module_map=module_map) 2025-05-07T20:32:54.5201233Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5201337Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5201413Z E ^ 2025-05-07T20:32:54.5201776Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5201789Z 2025-05-07T20:32:54.5202216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5202221Z 2025-05-07T20:32:54.5202323Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5202599Z self=, 2025-05-07T20:32:54.5202678Z T=16384, 2025-05-07T20:32:54.5202755Z D=7168, 2025-05-07T20:32:54.5202845Z scale_ub=1200.0, 2025-05-07T20:32:54.5202932Z contiguous=False, 2025-05-07T20:32:54.5203012Z compiled=True, 2025-05-07T20:32:54.5203093Z ) 2025-05-07T20:32:54.5203318Z self = 2025-05-07T20:32:54.5203512Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.5203516Z 2025-05-07T20:32:54.5203634Z @given( 2025-05-07T20:32:54.5203753Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5203857Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5203973Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5204094Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5204214Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5204291Z ) 2025-05-07T20:32:54.5204549Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5204644Z def test_silu_mul_quant( 2025-05-07T20:32:54.5204719Z self, 2025-05-07T20:32:54.5204802Z T: int, 2025-05-07T20:32:54.5204879Z D: int, 2025-05-07T20:32:54.5204982Z scale_ub: Optional[float], 2025-05-07T20:32:54.5205076Z contiguous: bool, 2025-05-07T20:32:54.5205162Z compiled: bool, 2025-05-07T20:32:54.5205240Z ) -> None: 2025-05-07T20:32:54.5205341Z torch.manual_seed(2025) 2025-05-07T20:32:54.5205414Z 2025-05-07T20:32:54.5205584Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5205663Z 2025-05-07T20:32:54.5205757Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5205883Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5205978Z x = x_sign * x_clamp 2025-05-07T20:32:54.5206058Z x0 = x[:, :D] 2025-05-07T20:32:54.5206419Z x1 = x[:, D:] 2025-05-07T20:32:54.5206526Z 2025-05-07T20:32:54.5206643Z if contiguous: 2025-05-07T20:32:54.5206771Z x0 = x0.contiguous() 2025-05-07T20:32:54.5206891Z x1 = x1.contiguous() 2025-05-07T20:32:54.5206997Z 2025-05-07T20:32:54.5207096Z if scale_ub is not None: 2025-05-07T20:32:54.5207208Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5207346Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5207427Z ) 2025-05-07T20:32:54.5207505Z else: 2025-05-07T20:32:54.5207605Z scale_ub_tensor = None 2025-05-07T20:32:54.5207689Z 2025-05-07T20:32:54.5207821Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5207919Z op = silu_mul_quant 2025-05-07T20:32:54.5208198Z if compiled: 2025-05-07T20:32:54.5208302Z op = torch.compile(op) 2025-05-07T20:32:54.5208417Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5208491Z 2025-05-07T20:32:54.5208582Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5208586Z 2025-05-07T20:32:54.5208692Z moe/activation_test.py:117: 2025-05-07T20:32:54.5208823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5208997Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5209105Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5209481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.5209587Z return fn(*args, **kwargs) 2025-05-07T20:32:54.5210096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5210195Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5210570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5210858Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5211211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5211311Z kernel = self.compile( 2025-05-07T20:32:54.5211706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5211963Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5212097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5212175Z 2025-05-07T20:32:54.5212387Z self = 2025-05-07T20:32:54.5213204Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5213726Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c1a6520>} 2025-05-07T20:32:54.5214514Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5214715Z context = 2025-05-07T20:32:54.5214722Z 2025-05-07T20:32:54.5214897Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5215172Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5215284Z module_map=module_map) 2025-05-07T20:32:54.5215457Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5215560Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5215636Z E ^ 2025-05-07T20:32:54.5216013Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5216018Z 2025-05-07T20:32:54.5216455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5216462Z 2025-05-07T20:32:54.5216574Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5216809Z self=, 2025-05-07T20:32:54.5216890Z T=1, 2025-05-07T20:32:54.5216973Z D=7168, 2025-05-07T20:32:54.5217055Z scale_ub=None, 2025-05-07T20:32:54.5217140Z contiguous=False, 2025-05-07T20:32:54.5217231Z compiled=False, 2025-05-07T20:32:54.5217353Z ) 2025-05-07T20:32:54.5217582Z self = 2025-05-07T20:32:54.5217761Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.5217766Z 2025-05-07T20:32:54.5217843Z @given( 2025-05-07T20:32:54.5217969Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5218069Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5218184Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5218355Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5218488Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5218567Z ) 2025-05-07T20:32:54.5218852Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5218947Z def test_silu_mul_quant( 2025-05-07T20:32:54.5219033Z self, 2025-05-07T20:32:54.5219110Z T: int, 2025-05-07T20:32:54.5219189Z D: int, 2025-05-07T20:32:54.5219292Z scale_ub: Optional[float], 2025-05-07T20:32:54.5219381Z contiguous: bool, 2025-05-07T20:32:54.5219467Z compiled: bool, 2025-05-07T20:32:54.5219591Z ) -> None: 2025-05-07T20:32:54.5219687Z torch.manual_seed(2025) 2025-05-07T20:32:54.5219762Z 2025-05-07T20:32:54.5219939Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5220011Z 2025-05-07T20:32:54.5220104Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5220236Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5220325Z x = x_sign * x_clamp 2025-05-07T20:32:54.5220404Z x0 = x[:, :D] 2025-05-07T20:32:54.5220493Z x1 = x[:, D:] 2025-05-07T20:32:54.5220606Z 2025-05-07T20:32:54.5220695Z if contiguous: 2025-05-07T20:32:54.5220785Z x0 = x0.contiguous() 2025-05-07T20:32:54.5220874Z x1 = x1.contiguous() 2025-05-07T20:32:54.5220954Z 2025-05-07T20:32:54.5221049Z if scale_ub is not None: 2025-05-07T20:32:54.5221154Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5221298Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5221373Z ) 2025-05-07T20:32:54.5221448Z else: 2025-05-07T20:32:54.5221547Z scale_ub_tensor = None 2025-05-07T20:32:54.5221624Z 2025-05-07T20:32:54.5221753Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5221849Z op = silu_mul_quant 2025-05-07T20:32:54.5221938Z if compiled: 2025-05-07T20:32:54.5222042Z op = torch.compile(op) 2025-05-07T20:32:54.5222147Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5222221Z 2025-05-07T20:32:54.5222318Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5222323Z 2025-05-07T20:32:54.5222419Z moe/activation_test.py:117: 2025-05-07T20:32:54.5222553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5222661Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5222762Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5223286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5223389Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5223763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5224001Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5224357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5224455Z kernel = self.compile( 2025-05-07T20:32:54.5224862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5225089Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5225227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5225231Z 2025-05-07T20:32:54.5225447Z self = 2025-05-07T20:32:54.5226267Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5226882Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c1a7100>} 2025-05-07T20:32:54.5227670Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5227874Z context = 2025-05-07T20:32:54.5227878Z 2025-05-07T20:32:54.5228084Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5228357Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5228469Z module_map=module_map) 2025-05-07T20:32:54.5228634Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5228743Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5228820Z E ^ 2025-05-07T20:32:54.5229186Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5229232Z 2025-05-07T20:32:54.5229669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5229674Z 2025-05-07T20:32:54.5229780Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5230022Z self=, 2025-05-07T20:32:54.5230100Z T=2048, 2025-05-07T20:32:54.5230181Z D=7168, 2025-05-07T20:32:54.5230272Z scale_ub=None, 2025-05-07T20:32:54.5230360Z contiguous=False, 2025-05-07T20:32:54.5230446Z compiled=True, 2025-05-07T20:32:54.5230525Z ) 2025-05-07T20:32:54.5230752Z self = 2025-05-07T20:32:54.5230931Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.5230938Z 2025-05-07T20:32:54.5231025Z @given( 2025-05-07T20:32:54.5231146Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5231252Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5231369Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5231487Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5231610Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5231683Z ) 2025-05-07T20:32:54.5231934Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5232035Z def test_silu_mul_quant( 2025-05-07T20:32:54.5232111Z self, 2025-05-07T20:32:54.5232187Z T: int, 2025-05-07T20:32:54.5232270Z D: int, 2025-05-07T20:32:54.5232366Z scale_ub: Optional[float], 2025-05-07T20:32:54.5232455Z contiguous: bool, 2025-05-07T20:32:54.5232545Z compiled: bool, 2025-05-07T20:32:54.5232625Z ) -> None: 2025-05-07T20:32:54.5232724Z torch.manual_seed(2025) 2025-05-07T20:32:54.5232797Z 2025-05-07T20:32:54.5232967Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5233047Z 2025-05-07T20:32:54.5233139Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5233264Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5233357Z x = x_sign * x_clamp 2025-05-07T20:32:54.5233483Z x0 = x[:, :D] 2025-05-07T20:32:54.5233569Z x1 = x[:, D:] 2025-05-07T20:32:54.5233646Z 2025-05-07T20:32:54.5233730Z if contiguous: 2025-05-07T20:32:54.5233824Z x0 = x0.contiguous() 2025-05-07T20:32:54.5233920Z x1 = x1.contiguous() 2025-05-07T20:32:54.5233992Z 2025-05-07T20:32:54.5234083Z if scale_ub is not None: 2025-05-07T20:32:54.5234194Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5234330Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5234456Z ) 2025-05-07T20:32:54.5234532Z else: 2025-05-07T20:32:54.5234627Z scale_ub_tensor = None 2025-05-07T20:32:54.5234708Z 2025-05-07T20:32:54.5234841Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5234933Z op = silu_mul_quant 2025-05-07T20:32:54.5235024Z if compiled: 2025-05-07T20:32:54.5235122Z op = torch.compile(op) 2025-05-07T20:32:54.5235230Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5235310Z 2025-05-07T20:32:54.5235399Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5235403Z 2025-05-07T20:32:54.5235552Z moe/activation_test.py:117: 2025-05-07T20:32:54.5235684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5235785Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5235891Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5236267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.5236364Z return fn(*args, **kwargs) 2025-05-07T20:32:54.5236877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5237014Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5237392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5237621Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5237972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5238077Z kernel = self.compile( 2025-05-07T20:32:54.5238470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5238649Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5238790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5238795Z 2025-05-07T20:32:54.5239033Z self = 2025-05-07T20:32:54.5239866Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5240389Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c7d0720>} 2025-05-07T20:32:54.5241170Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5241367Z context = 2025-05-07T20:32:54.5241372Z 2025-05-07T20:32:54.5241540Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5241822Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5241929Z module_map=module_map) 2025-05-07T20:32:54.5242138Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5242247Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5242324Z E ^ 2025-05-07T20:32:54.5242698Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5242703Z 2025-05-07T20:32:54.5243129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5243134Z 2025-05-07T20:32:54.5243237Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5243515Z self=, 2025-05-07T20:32:54.5243591Z T=4096, 2025-05-07T20:32:54.5243677Z D=7168, 2025-05-07T20:32:54.5243765Z scale_ub=None, 2025-05-07T20:32:54.5243854Z contiguous=False, 2025-05-07T20:32:54.5243945Z compiled=True, 2025-05-07T20:32:54.5244019Z ) 2025-05-07T20:32:54.5244244Z self = 2025-05-07T20:32:54.5244431Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.5244436Z 2025-05-07T20:32:54.5244514Z @given( 2025-05-07T20:32:54.5244679Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5244787Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5244903Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5245027Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5245140Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5245221Z ) 2025-05-07T20:32:54.5245480Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5245574Z def test_silu_mul_quant( 2025-05-07T20:32:54.5245693Z self, 2025-05-07T20:32:54.5245776Z T: int, 2025-05-07T20:32:54.5245853Z D: int, 2025-05-07T20:32:54.5245952Z scale_ub: Optional[float], 2025-05-07T20:32:54.5246048Z contiguous: bool, 2025-05-07T20:32:54.5246138Z compiled: bool, 2025-05-07T20:32:54.5246216Z ) -> None: 2025-05-07T20:32:54.5246316Z torch.manual_seed(2025) 2025-05-07T20:32:54.5246391Z 2025-05-07T20:32:54.5246570Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5246646Z 2025-05-07T20:32:54.5246741Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5246870Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5246958Z x = x_sign * x_clamp 2025-05-07T20:32:54.5247040Z x0 = x[:, :D] 2025-05-07T20:32:54.5247125Z x1 = x[:, D:] 2025-05-07T20:32:54.5247198Z 2025-05-07T20:32:54.5247280Z if contiguous: 2025-05-07T20:32:54.5247382Z x0 = x0.contiguous() 2025-05-07T20:32:54.5247474Z x1 = x1.contiguous() 2025-05-07T20:32:54.5247546Z 2025-05-07T20:32:54.5247642Z if scale_ub is not None: 2025-05-07T20:32:54.5247747Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5247885Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5247965Z ) 2025-05-07T20:32:54.5248041Z else: 2025-05-07T20:32:54.5248145Z scale_ub_tensor = None 2025-05-07T20:32:54.5248215Z 2025-05-07T20:32:54.5248344Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5248439Z op = silu_mul_quant 2025-05-07T20:32:54.5248525Z if compiled: 2025-05-07T20:32:54.5248624Z op = torch.compile(op) 2025-05-07T20:32:54.5248738Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5248814Z 2025-05-07T20:32:54.5248912Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5248918Z 2025-05-07T20:32:54.5249038Z moe/activation_test.py:117: 2025-05-07T20:32:54.5249195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5249301Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5249402Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5249828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.5249931Z return fn(*args, **kwargs) 2025-05-07T20:32:54.5250438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5250535Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5250907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5251173Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5251527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5251626Z kernel = self.compile( 2025-05-07T20:32:54.5252114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5252298Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5252491Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5252496Z 2025-05-07T20:32:54.5252715Z self = 2025-05-07T20:32:54.5253519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5254038Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c7d1440>} 2025-05-07T20:32:54.5254856Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5255054Z context = 2025-05-07T20:32:54.5255058Z 2025-05-07T20:32:54.5255236Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5255506Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5255614Z module_map=module_map) 2025-05-07T20:32:54.5255788Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5255890Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5255966Z E ^ 2025-05-07T20:32:54.5256336Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5256344Z 2025-05-07T20:32:54.5256770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5256774Z 2025-05-07T20:32:54.5256885Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5257131Z self=, 2025-05-07T20:32:54.5257211Z T=16384, 2025-05-07T20:32:54.5257296Z D=5120, 2025-05-07T20:32:54.5257380Z scale_ub=1200.0, 2025-05-07T20:32:54.5257474Z contiguous=False, 2025-05-07T20:32:54.5263081Z compiled=False, 2025-05-07T20:32:54.5263170Z ) 2025-05-07T20:32:54.5263421Z self = 2025-05-07T20:32:54.5263616Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.5263621Z 2025-05-07T20:32:54.5263707Z @given( 2025-05-07T20:32:54.5263829Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5263933Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5264055Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5264286Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5264402Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5264484Z ) 2025-05-07T20:32:54.5264743Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5264838Z def test_silu_mul_quant( 2025-05-07T20:32:54.5264921Z self, 2025-05-07T20:32:54.5264999Z T: int, 2025-05-07T20:32:54.5265079Z D: int, 2025-05-07T20:32:54.5265183Z scale_ub: Optional[float], 2025-05-07T20:32:54.5265318Z contiguous: bool, 2025-05-07T20:32:54.5265410Z compiled: bool, 2025-05-07T20:32:54.5265489Z ) -> None: 2025-05-07T20:32:54.5265584Z torch.manual_seed(2025) 2025-05-07T20:32:54.5265666Z 2025-05-07T20:32:54.5265844Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5265918Z 2025-05-07T20:32:54.5266017Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5266144Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5266237Z x = x_sign * x_clamp 2025-05-07T20:32:54.5266323Z x0 = x[:, :D] 2025-05-07T20:32:54.5266403Z x1 = x[:, D:] 2025-05-07T20:32:54.5266519Z 2025-05-07T20:32:54.5266610Z if contiguous: 2025-05-07T20:32:54.5266702Z x0 = x0.contiguous() 2025-05-07T20:32:54.5266799Z x1 = x1.contiguous() 2025-05-07T20:32:54.5266871Z 2025-05-07T20:32:54.5266962Z if scale_ub is not None: 2025-05-07T20:32:54.5267075Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5267215Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5267293Z ) 2025-05-07T20:32:54.5267378Z else: 2025-05-07T20:32:54.5267471Z scale_ub_tensor = None 2025-05-07T20:32:54.5267586Z 2025-05-07T20:32:54.5267727Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5267818Z op = silu_mul_quant 2025-05-07T20:32:54.5267902Z if compiled: 2025-05-07T20:32:54.5268019Z op = torch.compile(op) 2025-05-07T20:32:54.5268124Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5268205Z 2025-05-07T20:32:54.5268299Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5268304Z 2025-05-07T20:32:54.5268402Z moe/activation_test.py:117: 2025-05-07T20:32:54.5268543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5268651Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5268762Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5269328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5269431Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5269809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5270039Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5270392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5270497Z kernel = self.compile( 2025-05-07T20:32:54.5270890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5271071Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5271209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5271215Z 2025-05-07T20:32:54.5271425Z self = 2025-05-07T20:32:54.5272239Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5272809Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c7d2340>} 2025-05-07T20:32:54.5273591Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5273788Z context = 2025-05-07T20:32:54.5273831Z 2025-05-07T20:32:54.5274000Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5274277Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5274388Z module_map=module_map) 2025-05-07T20:32:54.5274553Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5274656Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5274732Z E ^ 2025-05-07T20:32:54.5275107Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5275112Z 2025-05-07T20:32:54.5275577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5275582Z 2025-05-07T20:32:54.5275687Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5275922Z self=, 2025-05-07T20:32:54.5276001Z T=16384, 2025-05-07T20:32:54.5276082Z D=5120, 2025-05-07T20:32:54.5276166Z scale_ub=1200.0, 2025-05-07T20:32:54.5276252Z contiguous=True, 2025-05-07T20:32:54.5276340Z compiled=True, 2025-05-07T20:32:54.5276457Z ) 2025-05-07T20:32:54.5276682Z self = 2025-05-07T20:32:54.5276869Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.5276877Z 2025-05-07T20:32:54.5276954Z @given( 2025-05-07T20:32:54.5277075Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5277182Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5277297Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5277421Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5277535Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5277609Z ) 2025-05-07T20:32:54.5277867Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5277962Z def test_silu_mul_quant( 2025-05-07T20:32:54.5278039Z self, 2025-05-07T20:32:54.5278122Z T: int, 2025-05-07T20:32:54.5278202Z D: int, 2025-05-07T20:32:54.5278300Z scale_ub: Optional[float], 2025-05-07T20:32:54.5278396Z contiguous: bool, 2025-05-07T20:32:54.5278481Z compiled: bool, 2025-05-07T20:32:54.5278559Z ) -> None: 2025-05-07T20:32:54.5278685Z torch.manual_seed(2025) 2025-05-07T20:32:54.5278762Z 2025-05-07T20:32:54.5278965Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5279042Z 2025-05-07T20:32:54.5279135Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5279266Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5279354Z x = x_sign * x_clamp 2025-05-07T20:32:54.5279433Z x0 = x[:, :D] 2025-05-07T20:32:54.5279520Z x1 = x[:, D:] 2025-05-07T20:32:54.5279593Z 2025-05-07T20:32:54.5279678Z if contiguous: 2025-05-07T20:32:54.5279775Z x0 = x0.contiguous() 2025-05-07T20:32:54.5279864Z x1 = x1.contiguous() 2025-05-07T20:32:54.5279937Z 2025-05-07T20:32:54.5280037Z if scale_ub is not None: 2025-05-07T20:32:54.5280142Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5280280Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5280361Z ) 2025-05-07T20:32:54.5280489Z else: 2025-05-07T20:32:54.5280591Z scale_ub_tensor = None 2025-05-07T20:32:54.5280666Z 2025-05-07T20:32:54.5280797Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5280892Z op = silu_mul_quant 2025-05-07T20:32:54.5280978Z if compiled: 2025-05-07T20:32:54.5281078Z op = torch.compile(op) 2025-05-07T20:32:54.5281191Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5281265Z 2025-05-07T20:32:54.5281398Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5281403Z 2025-05-07T20:32:54.5281507Z moe/activation_test.py:117: 2025-05-07T20:32:54.5281644Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5281753Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5281853Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5282234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.5282333Z return fn(*args, **kwargs) 2025-05-07T20:32:54.5282882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5282981Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5283355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5283582Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5283943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5284037Z kernel = self.compile( 2025-05-07T20:32:54.5284471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5284662Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5284796Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5284801Z 2025-05-07T20:32:54.5285019Z self = 2025-05-07T20:32:54.5285821Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5286347Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0c7d39c0>} 2025-05-07T20:32:54.5287129Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5287329Z context = 2025-05-07T20:32:54.5287334Z 2025-05-07T20:32:54.5287509Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5287783Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5287893Z module_map=module_map) 2025-05-07T20:32:54.5288063Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5288164Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5288243Z E ^ 2025-05-07T20:32:54.5288656Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5288662Z 2025-05-07T20:32:54.5289103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5289110Z 2025-05-07T20:32:54.5289219Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5289496Z self=, 2025-05-07T20:32:54.5289577Z T=16384, 2025-05-07T20:32:54.5289661Z D=5120, 2025-05-07T20:32:54.5289745Z scale_ub=None, 2025-05-07T20:32:54.5289841Z contiguous=False, 2025-05-07T20:32:54.5289927Z compiled=True, 2025-05-07T20:32:54.5289999Z ) 2025-05-07T20:32:54.5290232Z self = 2025-05-07T20:32:54.5290414Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.5290460Z 2025-05-07T20:32:54.5290536Z @given( 2025-05-07T20:32:54.5290664Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5290764Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5290884Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5291008Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5291121Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5291201Z ) 2025-05-07T20:32:54.5291456Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5291547Z def test_silu_mul_quant( 2025-05-07T20:32:54.5291670Z self, 2025-05-07T20:32:54.5291749Z T: int, 2025-05-07T20:32:54.5291950Z D: int, 2025-05-07T20:32:54.5292054Z scale_ub: Optional[float], 2025-05-07T20:32:54.5292143Z contiguous: bool, 2025-05-07T20:32:54.5292227Z compiled: bool, 2025-05-07T20:32:54.5292316Z ) -> None: 2025-05-07T20:32:54.5292412Z torch.manual_seed(2025) 2025-05-07T20:32:54.5292485Z 2025-05-07T20:32:54.5292660Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5292732Z 2025-05-07T20:32:54.5292916Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5293043Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5293131Z x = x_sign * x_clamp 2025-05-07T20:32:54.5293216Z x0 = x[:, :D] 2025-05-07T20:32:54.5293299Z x1 = x[:, D:] 2025-05-07T20:32:54.5293371Z 2025-05-07T20:32:54.5293462Z if contiguous: 2025-05-07T20:32:54.5293555Z x0 = x0.contiguous() 2025-05-07T20:32:54.5293647Z x1 = x1.contiguous() 2025-05-07T20:32:54.5293727Z 2025-05-07T20:32:54.5293816Z if scale_ub is not None: 2025-05-07T20:32:54.5293920Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5294064Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5294142Z ) 2025-05-07T20:32:54.5294218Z else: 2025-05-07T20:32:54.5294318Z scale_ub_tensor = None 2025-05-07T20:32:54.5294390Z 2025-05-07T20:32:54.5294524Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5294618Z op = silu_mul_quant 2025-05-07T20:32:54.5294705Z if compiled: 2025-05-07T20:32:54.5294810Z op = torch.compile(op) 2025-05-07T20:32:54.5294914Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5294987Z 2025-05-07T20:32:54.5295087Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5295091Z 2025-05-07T20:32:54.5295187Z moe/activation_test.py:117: 2025-05-07T20:32:54.5295323Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5295430Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5295531Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5295916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.5296010Z return fn(*args, **kwargs) 2025-05-07T20:32:54.5296521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5296626Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5296994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5297274Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5297639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5297735Z kernel = self.compile( 2025-05-07T20:32:54.5298139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5298318Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5298518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5298523Z 2025-05-07T20:32:54.5298740Z self = 2025-05-07T20:32:54.5299553Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5300120Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0ce40c20>} 2025-05-07T20:32:54.5300893Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5301096Z context = 2025-05-07T20:32:54.5301103Z 2025-05-07T20:32:54.5301273Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5301544Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5301700Z module_map=module_map) 2025-05-07T20:32:54.5301865Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5301969Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5302056Z E ^ 2025-05-07T20:32:54.5302423Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5302427Z 2025-05-07T20:32:54.5302860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5302864Z 2025-05-07T20:32:54.5302967Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5303201Z self=, 2025-05-07T20:32:54.5303285Z T=2048, 2025-05-07T20:32:54.5303361Z D=5120, 2025-05-07T20:32:54.5303448Z scale_ub=None, 2025-05-07T20:32:54.5303542Z contiguous=False, 2025-05-07T20:32:54.5303629Z compiled=True, 2025-05-07T20:32:54.5303702Z ) 2025-05-07T20:32:54.5303932Z self = 2025-05-07T20:32:54.5304114Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.5304118Z 2025-05-07T20:32:54.5304204Z @given( 2025-05-07T20:32:54.5304327Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5304427Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5304552Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5304670Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5304784Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5304868Z ) 2025-05-07T20:32:54.5305123Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5305219Z def test_silu_mul_quant( 2025-05-07T20:32:54.5305295Z self, 2025-05-07T20:32:54.5305374Z T: int, 2025-05-07T20:32:54.5305456Z D: int, 2025-05-07T20:32:54.5305554Z scale_ub: Optional[float], 2025-05-07T20:32:54.5305642Z contiguous: bool, 2025-05-07T20:32:54.5305732Z compiled: bool, 2025-05-07T20:32:54.5305856Z ) -> None: 2025-05-07T20:32:54.5305952Z torch.manual_seed(2025) 2025-05-07T20:32:54.5306032Z 2025-05-07T20:32:54.5306642Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5306751Z 2025-05-07T20:32:54.5306887Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5307019Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5307115Z x = x_sign * x_clamp 2025-05-07T20:32:54.5307197Z x0 = x[:, :D] 2025-05-07T20:32:54.5307468Z x1 = x[:, D:] 2025-05-07T20:32:54.5307548Z 2025-05-07T20:32:54.5307631Z if contiguous: 2025-05-07T20:32:54.5307722Z x0 = x0.contiguous() 2025-05-07T20:32:54.5307816Z x1 = x1.contiguous() 2025-05-07T20:32:54.5307889Z 2025-05-07T20:32:54.5307978Z if scale_ub is not None: 2025-05-07T20:32:54.5308088Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5308228Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5308305Z ) 2025-05-07T20:32:54.5308407Z else: 2025-05-07T20:32:54.5308509Z scale_ub_tensor = None 2025-05-07T20:32:54.5308671Z 2025-05-07T20:32:54.5308810Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5308899Z op = silu_mul_quant 2025-05-07T20:32:54.5308991Z if compiled: 2025-05-07T20:32:54.5309091Z op = torch.compile(op) 2025-05-07T20:32:54.5309196Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5309277Z 2025-05-07T20:32:54.5309367Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5309372Z 2025-05-07T20:32:54.5309469Z moe/activation_test.py:117: 2025-05-07T20:32:54.5309605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5309776Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5309877Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5310262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.5310355Z return fn(*args, **kwargs) 2025-05-07T20:32:54.5310872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5310970Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5311337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5311574Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5311923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5312027Z kernel = self.compile( 2025-05-07T20:32:54.5312421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5312601Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5312739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5312743Z 2025-05-07T20:32:54.5312960Z self = 2025-05-07T20:32:54.5313766Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5314297Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0ce419e0>} 2025-05-07T20:32:54.5315074Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5315349Z context = 2025-05-07T20:32:54.5315354Z 2025-05-07T20:32:54.5315530Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5315808Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5315916Z module_map=module_map) 2025-05-07T20:32:54.5316082Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5316230Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5316308Z E ^ 2025-05-07T20:32:54.5316676Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5316692Z 2025-05-07T20:32:54.5317119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5317124Z 2025-05-07T20:32:54.5317230Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5317469Z self=, 2025-05-07T20:32:54.5317547Z T=2048, 2025-05-07T20:32:54.5317663Z D=5120, 2025-05-07T20:32:54.5317755Z scale_ub=1200.0, 2025-05-07T20:32:54.5317842Z contiguous=False, 2025-05-07T20:32:54.5317926Z compiled=True, 2025-05-07T20:32:54.5318006Z ) 2025-05-07T20:32:54.5318232Z self = 2025-05-07T20:32:54.5318422Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.5318426Z 2025-05-07T20:32:54.5318504Z @given( 2025-05-07T20:32:54.5318623Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5318792Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5318907Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5319041Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5319181Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5319272Z ) 2025-05-07T20:32:54.5319527Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5319634Z def test_silu_mul_quant( 2025-05-07T20:32:54.5319710Z self, 2025-05-07T20:32:54.5319794Z T: int, 2025-05-07T20:32:54.5319870Z D: int, 2025-05-07T20:32:54.5319972Z scale_ub: Optional[float], 2025-05-07T20:32:54.5320068Z contiguous: bool, 2025-05-07T20:32:54.5320156Z compiled: bool, 2025-05-07T20:32:54.5320233Z ) -> None: 2025-05-07T20:32:54.5320335Z torch.manual_seed(2025) 2025-05-07T20:32:54.5320410Z 2025-05-07T20:32:54.5320580Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5320666Z 2025-05-07T20:32:54.5320757Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5320885Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5320976Z x = x_sign * x_clamp 2025-05-07T20:32:54.5321061Z x0 = x[:, :D] 2025-05-07T20:32:54.5321146Z x1 = x[:, D:] 2025-05-07T20:32:54.5321218Z 2025-05-07T20:32:54.5321300Z if contiguous: 2025-05-07T20:32:54.5321400Z x0 = x0.contiguous() 2025-05-07T20:32:54.5321489Z x1 = x1.contiguous() 2025-05-07T20:32:54.5321561Z 2025-05-07T20:32:54.5321656Z if scale_ub is not None: 2025-05-07T20:32:54.5321760Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5321896Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5321982Z ) 2025-05-07T20:32:54.5322059Z else: 2025-05-07T20:32:54.5322153Z scale_ub_tensor = None 2025-05-07T20:32:54.5322230Z 2025-05-07T20:32:54.5322364Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5322473Z op = silu_mul_quant 2025-05-07T20:32:54.5322567Z if compiled: 2025-05-07T20:32:54.5322668Z op = torch.compile(op) 2025-05-07T20:32:54.5322824Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5322905Z 2025-05-07T20:32:54.5322997Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5323004Z 2025-05-07T20:32:54.5323102Z moe/activation_test.py:117: 2025-05-07T20:32:54.5323239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5323342Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5323452Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5323871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.5323964Z return fn(*args, **kwargs) 2025-05-07T20:32:54.5324480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5324581Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5324952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5325187Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5325608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5325711Z kernel = self.compile( 2025-05-07T20:32:54.5326107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5326288Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5326425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5326429Z 2025-05-07T20:32:54.5326680Z self = 2025-05-07T20:32:54.5327493Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5328016Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fdb0ce42b60>} 2025-05-07T20:32:54.5328791Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5328994Z context = 2025-05-07T20:32:54.5328999Z 2025-05-07T20:32:54.5329168Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5329450Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5329560Z module_map=module_map) 2025-05-07T20:32:54.5329727Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5329833Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5329912Z E ^ 2025-05-07T20:32:54.5330286Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5330290Z 2025-05-07T20:32:54.5330719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5330724Z 2025-05-07T20:32:54.5330831Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5331068Z self=, 2025-05-07T20:32:54.5331144Z T=4096, 2025-05-07T20:32:54.5331219Z D=5120, 2025-05-07T20:32:54.5331310Z scale_ub=1200.0, 2025-05-07T20:32:54.5331396Z contiguous=True, 2025-05-07T20:32:54.5331485Z compiled=True, 2025-05-07T20:32:54.5331557Z ) 2025-05-07T20:32:54.5331906Z self = 2025-05-07T20:32:54.5332096Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.5332100Z 2025-05-07T20:32:54.5332177Z @given( 2025-05-07T20:32:54.5332300Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5332410Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5332525Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5332644Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5333393Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5333466Z ) 2025-05-07T20:32:54.5333727Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5333820Z def test_silu_mul_quant( 2025-05-07T20:32:54.5333898Z self, 2025-05-07T20:32:54.5333980Z T: int, 2025-05-07T20:32:54.5334055Z D: int, 2025-05-07T20:32:54.5334152Z scale_ub: Optional[float], 2025-05-07T20:32:54.5334249Z contiguous: bool, 2025-05-07T20:32:54.5334336Z compiled: bool, 2025-05-07T20:32:54.5334414Z ) -> None: 2025-05-07T20:32:54.5334514Z torch.manual_seed(2025) 2025-05-07T20:32:54.5334630Z 2025-05-07T20:32:54.5334803Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5334888Z 2025-05-07T20:32:54.5334979Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5335110Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5335199Z x = x_sign * x_clamp 2025-05-07T20:32:54.5335283Z x0 = x[:, :D] 2025-05-07T20:32:54.5335368Z x1 = x[:, D:] 2025-05-07T20:32:54.5335441Z 2025-05-07T20:32:54.5335523Z if contiguous: 2025-05-07T20:32:54.5335623Z x0 = x0.contiguous() 2025-05-07T20:32:54.5335844Z x1 = x1.contiguous() 2025-05-07T20:32:54.5335916Z 2025-05-07T20:32:54.5336014Z if scale_ub is not None: 2025-05-07T20:32:54.5336120Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5336261Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5336348Z ) 2025-05-07T20:32:54.5336424Z else: 2025-05-07T20:32:54.5336520Z scale_ub_tensor = None 2025-05-07T20:32:54.5336600Z 2025-05-07T20:32:54.5336730Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5336828Z op = silu_mul_quant 2025-05-07T20:32:54.5336913Z if compiled: 2025-05-07T20:32:54.5337013Z op = torch.compile(op) 2025-05-07T20:32:54.5337127Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5337200Z 2025-05-07T20:32:54.5337294Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5337298Z 2025-05-07T20:32:54.5337404Z moe/activation_test.py:117: 2025-05-07T20:32:54.5337536Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5337638Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5337746Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5338124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.5338227Z return fn(*args, **kwargs) 2025-05-07T20:32:54.5338737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5338835Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5339212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5339444Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5339799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5339896Z kernel = self.compile( 2025-05-07T20:32:54.5340336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5340526Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5340660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5340664Z 2025-05-07T20:32:54.5340873Z self = 2025-05-07T20:32:54.5341687Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5342253Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd871f58220>} 2025-05-07T20:32:54.5343047Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5343243Z context = 2025-05-07T20:32:54.5343248Z 2025-05-07T20:32:54.5343463Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5343735Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5343845Z module_map=module_map) 2025-05-07T20:32:54.5344019Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5344117Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5344194Z E ^ 2025-05-07T20:32:54.5344565Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5344610Z 2025-05-07T20:32:54.5345040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5345049Z 2025-05-07T20:32:54.5345160Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5345393Z self=, 2025-05-07T20:32:54.5345470Z T=128, 2025-05-07T20:32:54.5345554Z D=5120, 2025-05-07T20:32:54.5345638Z scale_ub=1200.0, 2025-05-07T20:32:54.5345724Z contiguous=False, 2025-05-07T20:32:54.5345817Z compiled=True, 2025-05-07T20:32:54.5345889Z ) 2025-05-07T20:32:54.5346119Z self = 2025-05-07T20:32:54.5346299Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.5346303Z 2025-05-07T20:32:54.5346381Z @given( 2025-05-07T20:32:54.5346509Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5346610Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5346724Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5346851Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5346964Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5347039Z ) 2025-05-07T20:32:54.5347299Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5347392Z def test_silu_mul_quant( 2025-05-07T20:32:54.5347474Z self, 2025-05-07T20:32:54.5347549Z T: int, 2025-05-07T20:32:54.5347626Z D: int, 2025-05-07T20:32:54.5347729Z scale_ub: Optional[float], 2025-05-07T20:32:54.5347819Z contiguous: bool, 2025-05-07T20:32:54.5347904Z compiled: bool, 2025-05-07T20:32:54.5347990Z ) -> None: 2025-05-07T20:32:54.5348084Z torch.manual_seed(2025) 2025-05-07T20:32:54.5348155Z 2025-05-07T20:32:54.5348335Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5348408Z 2025-05-07T20:32:54.5348499Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5348676Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5348764Z x = x_sign * x_clamp 2025-05-07T20:32:54.5348848Z x0 = x[:, :D] 2025-05-07T20:32:54.5348927Z x1 = x[:, D:] 2025-05-07T20:32:54.5349002Z 2025-05-07T20:32:54.5349092Z if contiguous: 2025-05-07T20:32:54.5349182Z x0 = x0.contiguous() 2025-05-07T20:32:54.5349271Z x1 = x1.contiguous() 2025-05-07T20:32:54.5349350Z 2025-05-07T20:32:54.5349442Z if scale_ub is not None: 2025-05-07T20:32:54.5349549Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5349734Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5349811Z ) 2025-05-07T20:32:54.5349890Z else: 2025-05-07T20:32:54.5349991Z scale_ub_tensor = None 2025-05-07T20:32:54.5350068Z 2025-05-07T20:32:54.5350200Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5350296Z op = silu_mul_quant 2025-05-07T20:32:54.5350381Z if compiled: 2025-05-07T20:32:54.5350486Z op = torch.compile(op) 2025-05-07T20:32:54.5350591Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5350664Z 2025-05-07T20:32:54.5350802Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5350807Z 2025-05-07T20:32:54.5350906Z moe/activation_test.py:117: 2025-05-07T20:32:54.5351038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5351143Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5351247Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5351630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.5351722Z return fn(*args, **kwargs) 2025-05-07T20:32:54.5352271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5352374Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5352743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5352975Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5353331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5353425Z kernel = self.compile( 2025-05-07T20:32:54.5353824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5354005Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5354134Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5354142Z 2025-05-07T20:32:54.5354356Z self = 2025-05-07T20:32:54.5355165Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5355695Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd871f58ea0>} 2025-05-07T20:32:54.5356468Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5356664Z context = 2025-05-07T20:32:54.5356675Z 2025-05-07T20:32:54.5356851Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5357123Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5357281Z module_map=module_map) 2025-05-07T20:32:54.5357448Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5357548Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5357636Z E ^ 2025-05-07T20:32:54.5358002Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5358007Z 2025-05-07T20:32:54.5358444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5358519Z 2025-05-07T20:32:54.5358624Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5358852Z self=, 2025-05-07T20:32:54.5358940Z T=16384, 2025-05-07T20:32:54.5359018Z D=7168, 2025-05-07T20:32:54.5359112Z scale_ub=1200.0, 2025-05-07T20:32:54.5359219Z contiguous=True, 2025-05-07T20:32:54.5359318Z compiled=True, 2025-05-07T20:32:54.5359403Z ) 2025-05-07T20:32:54.5359637Z self = 2025-05-07T20:32:54.5359818Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.5359867Z 2025-05-07T20:32:54.5359954Z @given( 2025-05-07T20:32:54.5360075Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5360175Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5360296Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5360417Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5360530Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5360611Z ) 2025-05-07T20:32:54.5360863Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5361002Z def test_silu_mul_quant( 2025-05-07T20:32:54.5361085Z self, 2025-05-07T20:32:54.5361163Z T: int, 2025-05-07T20:32:54.5361247Z D: int, 2025-05-07T20:32:54.5361349Z scale_ub: Optional[float], 2025-05-07T20:32:54.5361439Z contiguous: bool, 2025-05-07T20:32:54.5361531Z compiled: bool, 2025-05-07T20:32:54.5361609Z ) -> None: 2025-05-07T20:32:54.5361706Z torch.manual_seed(2025) 2025-05-07T20:32:54.5361783Z 2025-05-07T20:32:54.5361951Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5362024Z 2025-05-07T20:32:54.5362121Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5362245Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5362336Z x = x_sign * x_clamp 2025-05-07T20:32:54.5362422Z x0 = x[:, :D] 2025-05-07T20:32:54.5362500Z x1 = x[:, D:] 2025-05-07T20:32:54.5362577Z 2025-05-07T20:32:54.5362663Z if contiguous: 2025-05-07T20:32:54.5362754Z x0 = x0.contiguous() 2025-05-07T20:32:54.5362851Z x1 = x1.contiguous() 2025-05-07T20:32:54.5362924Z 2025-05-07T20:32:54.5363015Z if scale_ub is not None: 2025-05-07T20:32:54.5363131Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5363267Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5363346Z ) 2025-05-07T20:32:54.5363429Z else: 2025-05-07T20:32:54.5363522Z scale_ub_tensor = None 2025-05-07T20:32:54.5363594Z 2025-05-07T20:32:54.5363733Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5363822Z op = silu_mul_quant 2025-05-07T20:32:54.5363908Z if compiled: 2025-05-07T20:32:54.5364013Z op = torch.compile(op) 2025-05-07T20:32:54.5364118Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5364196Z 2025-05-07T20:32:54.5364287Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5364294Z 2025-05-07T20:32:54.5364388Z moe/activation_test.py:117: 2025-05-07T20:32:54.5364527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5364677Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5364778Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5365164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.5365259Z return fn(*args, **kwargs) 2025-05-07T20:32:54.5365775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5365871Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5366279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5366512Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5366867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5366961Z kernel = self.compile( 2025-05-07T20:32:54.5367366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5367545Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5367722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5367727Z 2025-05-07T20:32:54.5367937Z self = 2025-05-07T20:32:54.5368740Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5369318Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd871f5a0c0>} 2025-05-07T20:32:54.5370138Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5370342Z context = 2025-05-07T20:32:54.5370347Z 2025-05-07T20:32:54.5370515Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5370791Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5370897Z module_map=module_map) 2025-05-07T20:32:54.5371064Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5371170Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5371247Z E ^ 2025-05-07T20:32:54.5371616Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5371620Z 2025-05-07T20:32:54.5372166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5372170Z 2025-05-07T20:32:54.5372274Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5372515Z self=, 2025-05-07T20:32:54.5372593Z T=16384, 2025-05-07T20:32:54.5372670Z D=5120, 2025-05-07T20:32:54.5372760Z scale_ub=1200.0, 2025-05-07T20:32:54.5372845Z contiguous=True, 2025-05-07T20:32:54.5372933Z compiled=False, 2025-05-07T20:32:54.5373018Z ) 2025-05-07T20:32:54.5373243Z self = 2025-05-07T20:32:54.5373424Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.5373438Z 2025-05-07T20:32:54.5373515Z @given( 2025-05-07T20:32:54.5373636Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5373742Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5373908Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5374030Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5374153Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5374230Z ) 2025-05-07T20:32:54.5374484Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5374587Z def test_silu_mul_quant( 2025-05-07T20:32:54.5374664Z self, 2025-05-07T20:32:54.5374741Z T: int, 2025-05-07T20:32:54.5374826Z D: int, 2025-05-07T20:32:54.5374967Z scale_ub: Optional[float], 2025-05-07T20:32:54.5375064Z contiguous: bool, 2025-05-07T20:32:54.5375149Z compiled: bool, 2025-05-07T20:32:54.5375227Z ) -> None: 2025-05-07T20:32:54.5375332Z torch.manual_seed(2025) 2025-05-07T20:32:54.5375408Z 2025-05-07T20:32:54.5375578Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5375659Z 2025-05-07T20:32:54.5375752Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5375880Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5375976Z x = x_sign * x_clamp 2025-05-07T20:32:54.5376099Z x0 = x[:, :D] 2025-05-07T20:32:54.5376181Z x1 = x[:, D:] 2025-05-07T20:32:54.5376260Z 2025-05-07T20:32:54.5376343Z if contiguous: 2025-05-07T20:32:54.5376433Z x0 = x0.contiguous() 2025-05-07T20:32:54.5376532Z x1 = x1.contiguous() 2025-05-07T20:32:54.5376602Z 2025-05-07T20:32:54.5376698Z if scale_ub is not None: 2025-05-07T20:32:54.5376803Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5376939Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5377021Z ) 2025-05-07T20:32:54.5377140Z else: 2025-05-07T20:32:54.5377235Z scale_ub_tensor = None 2025-05-07T20:32:54.5377313Z 2025-05-07T20:32:54.5377447Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5377540Z op = silu_mul_quant 2025-05-07T20:32:54.5377631Z if compiled: 2025-05-07T20:32:54.5377730Z op = torch.compile(op) 2025-05-07T20:32:54.5377837Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5377917Z 2025-05-07T20:32:54.5378008Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5378013Z 2025-05-07T20:32:54.5378124Z moe/activation_test.py:117: 2025-05-07T20:32:54.5378275Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5378401Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5378508Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5379022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5379121Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5379497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5379726Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5380084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5380179Z kernel = self.compile( 2025-05-07T20:32:54.5380572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5380756Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5380890Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5380896Z 2025-05-07T20:32:54.5381109Z self = 2025-05-07T20:32:54.5381957Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5382481Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd871f59a80>} 2025-05-07T20:32:54.5383261Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5383454Z context = 2025-05-07T20:32:54.5383504Z 2025-05-07T20:32:54.5383680Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5383951Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5384060Z module_map=module_map) 2025-05-07T20:32:54.5384229Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5384329Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5384413Z E ^ 2025-05-07T20:32:54.5384818Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5384822Z 2025-05-07T20:32:54.5385250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5385255Z 2025-05-07T20:32:54.5385363Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5385597Z self=, 2025-05-07T20:32:54.5385680Z T=1, 2025-05-07T20:32:54.5385756Z D=7168, 2025-05-07T20:32:54.5385839Z scale_ub=1200.0, 2025-05-07T20:32:54.5385973Z contiguous=False, 2025-05-07T20:32:54.5386058Z compiled=False, 2025-05-07T20:32:54.5386132Z ) 2025-05-07T20:32:54.5386361Z self = 2025-05-07T20:32:54.5386539Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.5386543Z 2025-05-07T20:32:54.5386621Z @given( 2025-05-07T20:32:54.5386751Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5386866Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5386982Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5387109Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5387223Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5387307Z ) 2025-05-07T20:32:54.5393059Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5393173Z def test_silu_mul_quant( 2025-05-07T20:32:54.5393258Z self, 2025-05-07T20:32:54.5393341Z T: int, 2025-05-07T20:32:54.5393422Z D: int, 2025-05-07T20:32:54.5393521Z scale_ub: Optional[float], 2025-05-07T20:32:54.5393610Z contiguous: bool, 2025-05-07T20:32:54.5393702Z compiled: bool, 2025-05-07T20:32:54.5393782Z ) -> None: 2025-05-07T20:32:54.5393877Z torch.manual_seed(2025) 2025-05-07T20:32:54.5393955Z 2025-05-07T20:32:54.5394132Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5394205Z 2025-05-07T20:32:54.5394305Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5394433Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5394522Z x = x_sign * x_clamp 2025-05-07T20:32:54.5394610Z x0 = x[:, :D] 2025-05-07T20:32:54.5394690Z x1 = x[:, D:] 2025-05-07T20:32:54.5394770Z 2025-05-07T20:32:54.5394855Z if contiguous: 2025-05-07T20:32:54.5394946Z x0 = x0.contiguous() 2025-05-07T20:32:54.5395048Z x1 = x1.contiguous() 2025-05-07T20:32:54.5395122Z 2025-05-07T20:32:54.5395211Z if scale_ub is not None: 2025-05-07T20:32:54.5395322Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5395568Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5395647Z ) 2025-05-07T20:32:54.5395729Z else: 2025-05-07T20:32:54.5395822Z scale_ub_tensor = None 2025-05-07T20:32:54.5395897Z 2025-05-07T20:32:54.5396035Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5396125Z op = silu_mul_quant 2025-05-07T20:32:54.5396209Z if compiled: 2025-05-07T20:32:54.5396318Z op = torch.compile(op) 2025-05-07T20:32:54.5396473Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5396554Z 2025-05-07T20:32:54.5396645Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5396649Z 2025-05-07T20:32:54.5396747Z moe/activation_test.py:117: 2025-05-07T20:32:54.5396888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5396990Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5397090Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5397623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5397768Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5398146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5398376Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5398730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5398835Z kernel = self.compile( 2025-05-07T20:32:54.5399275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5399505Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5399643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5399648Z 2025-05-07T20:32:54.5399862Z self = 2025-05-07T20:32:54.5400683Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5401203Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd871c640e0>} 2025-05-07T20:32:54.5401988Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5402185Z context = 2025-05-07T20:32:54.5402189Z 2025-05-07T20:32:54.5402360Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5402640Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5402750Z module_map=module_map) 2025-05-07T20:32:54.5402924Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5403025Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5403106Z E ^ 2025-05-07T20:32:54.5403481Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5403488Z 2025-05-07T20:32:54.5403917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5403924Z 2025-05-07T20:32:54.5404026Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5404267Z self=, 2025-05-07T20:32:54.5404344Z T=4096, 2025-05-07T20:32:54.5404471Z D=7168, 2025-05-07T20:32:54.5404558Z scale_ub=1200.0, 2025-05-07T20:32:54.5404645Z contiguous=False, 2025-05-07T20:32:54.5404734Z compiled=True, 2025-05-07T20:32:54.5404811Z ) 2025-05-07T20:32:54.5405035Z self = 2025-05-07T20:32:54.5405222Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.5405226Z 2025-05-07T20:32:54.5405304Z @given( 2025-05-07T20:32:54.5405425Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5405572Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5405688Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5405811Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5405928Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5406003Z ) 2025-05-07T20:32:54.5406569Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5406667Z def test_silu_mul_quant( 2025-05-07T20:32:54.5406744Z self, 2025-05-07T20:32:54.5406829Z T: int, 2025-05-07T20:32:54.5406904Z D: int, 2025-05-07T20:32:54.5407156Z scale_ub: Optional[float], 2025-05-07T20:32:54.5407254Z contiguous: bool, 2025-05-07T20:32:54.5407340Z compiled: bool, 2025-05-07T20:32:54.5407419Z ) -> None: 2025-05-07T20:32:54.5407520Z torch.manual_seed(2025) 2025-05-07T20:32:54.5407591Z 2025-05-07T20:32:54.5407771Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5407844Z 2025-05-07T20:32:54.5407937Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5408069Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5408228Z x = x_sign * x_clamp 2025-05-07T20:32:54.5408308Z x0 = x[:, :D] 2025-05-07T20:32:54.5408397Z x1 = x[:, D:] 2025-05-07T20:32:54.5408467Z 2025-05-07T20:32:54.5408551Z if contiguous: 2025-05-07T20:32:54.5408652Z x0 = x0.contiguous() 2025-05-07T20:32:54.5408741Z x1 = x1.contiguous() 2025-05-07T20:32:54.5408814Z 2025-05-07T20:32:54.5408912Z if scale_ub is not None: 2025-05-07T20:32:54.5409018Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5409162Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5409236Z ) 2025-05-07T20:32:54.5409313Z else: 2025-05-07T20:32:54.5409412Z scale_ub_tensor = None 2025-05-07T20:32:54.5409487Z 2025-05-07T20:32:54.5409617Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5409714Z op = silu_mul_quant 2025-05-07T20:32:54.5409800Z if compiled: 2025-05-07T20:32:54.5409902Z op = torch.compile(op) 2025-05-07T20:32:54.5410015Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5410087Z 2025-05-07T20:32:54.5410179Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5410185Z 2025-05-07T20:32:54.5410291Z moe/activation_test.py:117: 2025-05-07T20:32:54.5410426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5410535Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5410637Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5411016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.5411119Z return fn(*args, **kwargs) 2025-05-07T20:32:54.5411633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5411730Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5412188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5412419Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5412868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5412965Z kernel = self.compile( 2025-05-07T20:32:54.5413367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5413552Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5413683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5413753Z 2025-05-07T20:32:54.5413969Z self = 2025-05-07T20:32:54.5414784Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5415318Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd871c65300>} 2025-05-07T20:32:54.5416142Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5416339Z context = 2025-05-07T20:32:54.5416344Z 2025-05-07T20:32:54.5416525Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5416799Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5416905Z module_map=module_map) 2025-05-07T20:32:54.5417118Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5417217Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5417302Z E ^ 2025-05-07T20:32:54.5417673Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5417678Z 2025-05-07T20:32:54.5418111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5418115Z 2025-05-07T20:32:54.5418227Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5418460Z self=, 2025-05-07T20:32:54.5418544Z T=128, 2025-05-07T20:32:54.5418626Z D=7168, 2025-05-07T20:32:54.5418732Z scale_ub=1200.0, 2025-05-07T20:32:54.5418835Z contiguous=False, 2025-05-07T20:32:54.5418936Z compiled=True, 2025-05-07T20:32:54.5419010Z ) 2025-05-07T20:32:54.5419243Z self = 2025-05-07T20:32:54.5419420Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.5419424Z 2025-05-07T20:32:54.5419502Z @given( 2025-05-07T20:32:54.5419631Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5419730Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5419850Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5419978Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5420092Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5420173Z ) 2025-05-07T20:32:54.5420427Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5420522Z def test_silu_mul_quant( 2025-05-07T20:32:54.5420603Z self, 2025-05-07T20:32:54.5420683Z T: int, 2025-05-07T20:32:54.5420759Z D: int, 2025-05-07T20:32:54.5420869Z scale_ub: Optional[float], 2025-05-07T20:32:54.5420960Z contiguous: bool, 2025-05-07T20:32:54.5421045Z compiled: bool, 2025-05-07T20:32:54.5421129Z ) -> None: 2025-05-07T20:32:54.5421223Z torch.manual_seed(2025) 2025-05-07T20:32:54.5421342Z 2025-05-07T20:32:54.5421522Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5421595Z 2025-05-07T20:32:54.5421695Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5421819Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5421907Z x = x_sign * x_clamp 2025-05-07T20:32:54.5421997Z x0 = x[:, :D] 2025-05-07T20:32:54.5422076Z x1 = x[:, D:] 2025-05-07T20:32:54.5422146Z 2025-05-07T20:32:54.5422282Z if contiguous: 2025-05-07T20:32:54.5422373Z x0 = x0.contiguous() 2025-05-07T20:32:54.5422463Z x1 = x1.contiguous() 2025-05-07T20:32:54.5422540Z 2025-05-07T20:32:54.5422629Z if scale_ub is not None: 2025-05-07T20:32:54.5422737Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5422882Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5422957Z ) 2025-05-07T20:32:54.5423040Z else: 2025-05-07T20:32:54.5423140Z scale_ub_tensor = None 2025-05-07T20:32:54.5423212Z 2025-05-07T20:32:54.5423348Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5423482Z op = silu_mul_quant 2025-05-07T20:32:54.5423570Z if compiled: 2025-05-07T20:32:54.5423673Z op = torch.compile(op) 2025-05-07T20:32:54.5423777Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5423848Z 2025-05-07T20:32:54.5423942Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5423949Z 2025-05-07T20:32:54.5424045Z moe/activation_test.py:117: 2025-05-07T20:32:54.5424183Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5424284Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5424448Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5424837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.5424933Z return fn(*args, **kwargs) 2025-05-07T20:32:54.5425449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5425552Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5425926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5426161Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5426516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5426610Z kernel = self.compile( 2025-05-07T20:32:54.5427016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5427198Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5427332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5427336Z 2025-05-07T20:32:54.5427551Z self = 2025-05-07T20:32:54.5428371Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5428953Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd871c66020>} 2025-05-07T20:32:54.5429733Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5429937Z context = 2025-05-07T20:32:54.5429941Z 2025-05-07T20:32:54.5430154Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5430433Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5430548Z module_map=module_map) 2025-05-07T20:32:54.5430717Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5430816Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5430902Z E ^ 2025-05-07T20:32:54.5431311Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5431316Z 2025-05-07T20:32:54.5431753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5431760Z 2025-05-07T20:32:54.5431864Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5432100Z self=, 2025-05-07T20:32:54.5432182Z T=2048, 2025-05-07T20:32:54.5432259Z D=7168, 2025-05-07T20:32:54.5432341Z scale_ub=None, 2025-05-07T20:32:54.5432432Z contiguous=True, 2025-05-07T20:32:54.5432556Z compiled=True, 2025-05-07T20:32:54.5432637Z ) 2025-05-07T20:32:54.5432862Z self = 2025-05-07T20:32:54.5433041Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.5433046Z 2025-05-07T20:32:54.5433130Z @given( 2025-05-07T20:32:54.5433252Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5433352Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5433477Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5433640Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5433764Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5433838Z ) 2025-05-07T20:32:54.5434096Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5434198Z def test_silu_mul_quant( 2025-05-07T20:32:54.5434274Z self, 2025-05-07T20:32:54.5434354Z T: int, 2025-05-07T20:32:54.5434441Z D: int, 2025-05-07T20:32:54.5434540Z scale_ub: Optional[float], 2025-05-07T20:32:54.5434629Z contiguous: bool, 2025-05-07T20:32:54.5434724Z compiled: bool, 2025-05-07T20:32:54.5434802Z ) -> None: 2025-05-07T20:32:54.5434899Z torch.manual_seed(2025) 2025-05-07T20:32:54.5434981Z 2025-05-07T20:32:54.5435151Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5435224Z 2025-05-07T20:32:54.5435322Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5435449Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5435543Z x = x_sign * x_clamp 2025-05-07T20:32:54.5435624Z x0 = x[:, :D] 2025-05-07T20:32:54.5435704Z x1 = x[:, D:] 2025-05-07T20:32:54.5435783Z 2025-05-07T20:32:54.5435867Z if contiguous: 2025-05-07T20:32:54.5435958Z x0 = x0.contiguous() 2025-05-07T20:32:54.5436053Z x1 = x1.contiguous() 2025-05-07T20:32:54.5436131Z 2025-05-07T20:32:54.5436220Z if scale_ub is not None: 2025-05-07T20:32:54.5436332Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5436469Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5436543Z ) 2025-05-07T20:32:54.5436628Z else: 2025-05-07T20:32:54.5436723Z scale_ub_tensor = None 2025-05-07T20:32:54.5436804Z 2025-05-07T20:32:54.5436934Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5437024Z op = silu_mul_quant 2025-05-07T20:32:54.5437117Z if compiled: 2025-05-07T20:32:54.5437216Z op = torch.compile(op) 2025-05-07T20:32:54.5437323Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5437402Z 2025-05-07T20:32:54.5437547Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5437552Z 2025-05-07T20:32:54.5437649Z moe/activation_test.py:117: 2025-05-07T20:32:54.5437790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5437892Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5437997Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5438378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.5438562Z return fn(*args, **kwargs) 2025-05-07T20:32:54.5439087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5439187Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5439562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5439800Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5440154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5440297Z kernel = self.compile( 2025-05-07T20:32:54.5440696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5440873Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5441015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5441021Z 2025-05-07T20:32:54.5441228Z self = 2025-05-07T20:32:54.5442049Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5442617Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd871c67240>} 2025-05-07T20:32:54.5443396Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5443596Z context = 2025-05-07T20:32:54.5443603Z 2025-05-07T20:32:54.5443771Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5444049Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5444164Z module_map=module_map) 2025-05-07T20:32:54.5444330Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5444438Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5444518Z E ^ 2025-05-07T20:32:54.5444883Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5444897Z 2025-05-07T20:32:54.5445325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5445330Z 2025-05-07T20:32:54.5445433Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5445671Z self=, 2025-05-07T20:32:54.5445751Z T=16384, 2025-05-07T20:32:54.5445829Z D=5120, 2025-05-07T20:32:54.5445922Z scale_ub=None, 2025-05-07T20:32:54.5446008Z contiguous=False, 2025-05-07T20:32:54.5446094Z compiled=False, 2025-05-07T20:32:54.5446175Z ) 2025-05-07T20:32:54.5446399Z self = 2025-05-07T20:32:54.5446589Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.5446640Z 2025-05-07T20:32:54.5446717Z @given( 2025-05-07T20:32:54.5446839Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5446949Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5447064Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5447185Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5447308Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5447382Z ) 2025-05-07T20:32:54.5447686Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5447781Z def test_silu_mul_quant( 2025-05-07T20:32:54.5447857Z self, 2025-05-07T20:32:54.5447940Z T: int, 2025-05-07T20:32:54.5448020Z D: int, 2025-05-07T20:32:54.5448118Z scale_ub: Optional[float], 2025-05-07T20:32:54.5448213Z contiguous: bool, 2025-05-07T20:32:54.5448300Z compiled: bool, 2025-05-07T20:32:54.5448382Z ) -> None: 2025-05-07T20:32:54.5448486Z torch.manual_seed(2025) 2025-05-07T20:32:54.5448561Z 2025-05-07T20:32:54.5448732Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5448857Z 2025-05-07T20:32:54.5448951Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5449100Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5451038Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.5451085Z 2025-05-07T20:32:54.5451215Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:54.5451220Z 2025-05-07T20:32:54.5451323Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5451556Z self=, 2025-05-07T20:32:54.5451640Z T=4096, 2025-05-07T20:32:54.5451717Z D=7168, 2025-05-07T20:32:54.5451800Z scale_ub=1200.0, 2025-05-07T20:32:54.5451984Z contiguous=True, 2025-05-07T20:32:54.5452067Z compiled=True, 2025-05-07T20:32:54.5452144Z ) 2025-05-07T20:32:54.5452380Z self = 2025-05-07T20:32:54.5452558Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.5452563Z 2025-05-07T20:32:54.5452648Z @given( 2025-05-07T20:32:54.5452768Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5452866Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5452988Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5453105Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5453218Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5453301Z ) 2025-05-07T20:32:54.5453554Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5453648Z def test_silu_mul_quant( 2025-05-07T20:32:54.5453732Z self, 2025-05-07T20:32:54.5453808Z T: int, 2025-05-07T20:32:54.5453885Z D: int, 2025-05-07T20:32:54.5453990Z scale_ub: Optional[float], 2025-05-07T20:32:54.5454077Z contiguous: bool, 2025-05-07T20:32:54.5454166Z compiled: bool, 2025-05-07T20:32:54.5454263Z ) -> None: 2025-05-07T20:32:54.5454357Z torch.manual_seed(2025) 2025-05-07T20:32:54.5454435Z 2025-05-07T20:32:54.5454613Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5454687Z 2025-05-07T20:32:54.5454780Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5454961Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5456846Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.5456900Z 2025-05-07T20:32:54.5457024Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:54.5457031Z 2025-05-07T20:32:54.5457134Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5457369Z self=, 2025-05-07T20:32:54.5457449Z T=16384, 2025-05-07T20:32:54.5457524Z D=7168, 2025-05-07T20:32:54.5457614Z scale_ub=None, 2025-05-07T20:32:54.5457700Z contiguous=False, 2025-05-07T20:32:54.5457856Z compiled=False, 2025-05-07T20:32:54.5457938Z ) 2025-05-07T20:32:54.5458162Z self = 2025-05-07T20:32:54.5458352Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.5458356Z 2025-05-07T20:32:54.5458433Z @given( 2025-05-07T20:32:54.5458568Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5458685Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5458825Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5458989Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5459109Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5459183Z ) 2025-05-07T20:32:54.5459442Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5459545Z def test_silu_mul_quant( 2025-05-07T20:32:54.5459624Z self, 2025-05-07T20:32:54.5459710Z T: int, 2025-05-07T20:32:54.5459792Z D: int, 2025-05-07T20:32:54.5459891Z scale_ub: Optional[float], 2025-05-07T20:32:54.5459985Z contiguous: bool, 2025-05-07T20:32:54.5460071Z compiled: bool, 2025-05-07T20:32:54.5460150Z ) -> None: 2025-05-07T20:32:54.5460251Z torch.manual_seed(2025) 2025-05-07T20:32:54.5460328Z 2025-05-07T20:32:54.5460495Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5462382Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.5462391Z 2025-05-07T20:32:54.5462509Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.5462513Z 2025-05-07T20:32:54.5462621Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5462852Z self=, 2025-05-07T20:32:54.5462939Z T=2048, 2025-05-07T20:32:54.5463015Z D=7168, 2025-05-07T20:32:54.5463098Z scale_ub=1200.0, 2025-05-07T20:32:54.5463189Z contiguous=True, 2025-05-07T20:32:54.5463271Z compiled=True, 2025-05-07T20:32:54.5463346Z ) 2025-05-07T20:32:54.5463577Z self = 2025-05-07T20:32:54.5463752Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.5463757Z 2025-05-07T20:32:54.5463882Z @given( 2025-05-07T20:32:54.5464015Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5464115Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5464235Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5464352Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5464466Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5464546Z ) 2025-05-07T20:32:54.5464799Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5464936Z def test_silu_mul_quant( 2025-05-07T20:32:54.5465017Z self, 2025-05-07T20:32:54.5465096Z T: int, 2025-05-07T20:32:54.5465172Z D: int, 2025-05-07T20:32:54.5465281Z scale_ub: Optional[float], 2025-05-07T20:32:54.5465373Z contiguous: bool, 2025-05-07T20:32:54.5465458Z compiled: bool, 2025-05-07T20:32:54.5465542Z ) -> None: 2025-05-07T20:32:54.5465641Z torch.manual_seed(2025) 2025-05-07T20:32:54.5465720Z 2025-05-07T20:32:54.5465889Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5466010Z 2025-05-07T20:32:54.5466108Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5466236Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5468090Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.5468142Z 2025-05-07T20:32:54.5468279Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:54.5468285Z 2025-05-07T20:32:54.5468397Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5468654Z self=, 2025-05-07T20:32:54.5468734Z T=2048, 2025-05-07T20:32:54.5468810Z D=7168, 2025-05-07T20:32:54.5468901Z scale_ub=None, 2025-05-07T20:32:54.5468988Z contiguous=True, 2025-05-07T20:32:54.5469080Z compiled=False, 2025-05-07T20:32:54.5469156Z ) 2025-05-07T20:32:54.5469384Z self = 2025-05-07T20:32:54.5469566Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.5469570Z 2025-05-07T20:32:54.5469649Z @given( 2025-05-07T20:32:54.5469772Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5469881Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5469995Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5470115Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5470238Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5470314Z ) 2025-05-07T20:32:54.5470578Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5470675Z def test_silu_mul_quant( 2025-05-07T20:32:54.5470752Z self, 2025-05-07T20:32:54.5470837Z T: int, 2025-05-07T20:32:54.5470917Z D: int, 2025-05-07T20:32:54.5471015Z scale_ub: Optional[float], 2025-05-07T20:32:54.5471113Z contiguous: bool, 2025-05-07T20:32:54.5471200Z compiled: bool, 2025-05-07T20:32:54.5471278Z ) -> None: 2025-05-07T20:32:54.5471381Z torch.manual_seed(2025) 2025-05-07T20:32:54.5471458Z 2025-05-07T20:32:54.5471627Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5471706Z 2025-05-07T20:32:54.5471799Z > x_sign = torch.sign(x) 2025-05-07T20:32:54.5473713Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.5473756Z 2025-05-07T20:32:54.5473875Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:54.5473880Z 2025-05-07T20:32:54.5473988Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5474222Z self=, 2025-05-07T20:32:54.5474299Z T=1, 2025-05-07T20:32:54.5474382Z D=7168, 2025-05-07T20:32:54.5474468Z scale_ub=1200.0, 2025-05-07T20:32:54.5474557Z contiguous=True, 2025-05-07T20:32:54.5474651Z compiled=False, 2025-05-07T20:32:54.5474724Z ) 2025-05-07T20:32:54.5475090Z self = 2025-05-07T20:32:54.5475267Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.5475271Z 2025-05-07T20:32:54.5475348Z @given( 2025-05-07T20:32:54.5475475Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5475573Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5475688Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5475811Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5475926Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5476044Z ) 2025-05-07T20:32:54.5476303Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5476399Z def test_silu_mul_quant( 2025-05-07T20:32:54.5476476Z self, 2025-05-07T20:32:54.5476563Z T: int, 2025-05-07T20:32:54.5476641Z D: int, 2025-05-07T20:32:54.5476740Z scale_ub: Optional[float], 2025-05-07T20:32:54.5476837Z contiguous: bool, 2025-05-07T20:32:54.5476923Z compiled: bool, 2025-05-07T20:32:54.5477005Z ) -> None: 2025-05-07T20:32:54.5477099Z torch.manual_seed(2025) 2025-05-07T20:32:54.5477173Z 2025-05-07T20:32:54.5477348Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5477429Z 2025-05-07T20:32:54.5477521Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5477651Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5477740Z x = x_sign * x_clamp 2025-05-07T20:32:54.5477823Z x0 = x[:, :D] 2025-05-07T20:32:54.5477909Z x1 = x[:, D:] 2025-05-07T20:32:54.5477981Z 2025-05-07T20:32:54.5478065Z if contiguous: 2025-05-07T20:32:54.5478162Z x0 = x0.contiguous() 2025-05-07T20:32:54.5478255Z x1 = x1.contiguous() 2025-05-07T20:32:54.5478332Z 2025-05-07T20:32:54.5478423Z if scale_ub is not None: 2025-05-07T20:32:54.5478543Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5478707Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5478799Z ) 2025-05-07T20:32:54.5478876Z else: 2025-05-07T20:32:54.5478977Z scale_ub_tensor = None 2025-05-07T20:32:54.5479049Z 2025-05-07T20:32:54.5479180Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5479279Z op = silu_mul_quant 2025-05-07T20:32:54.5479364Z if compiled: 2025-05-07T20:32:54.5479463Z op = torch.compile(op) 2025-05-07T20:32:54.5479578Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5479650Z 2025-05-07T20:32:54.5479748Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5479752Z 2025-05-07T20:32:54.5479850Z moe/activation_test.py:117: 2025-05-07T20:32:54.5480030Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5480139Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5480240Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5480762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5480864Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5481237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5481514Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5481868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5481965Z kernel = self.compile( 2025-05-07T20:32:54.5482369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5482548Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5482719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5482731Z 2025-05-07T20:32:54.5482942Z self = 2025-05-07T20:32:54.5483744Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5484275Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd871da2520>} 2025-05-07T20:32:54.5485094Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5485294Z context = 2025-05-07T20:32:54.5485298Z 2025-05-07T20:32:54.5485469Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5485740Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5485855Z module_map=module_map) 2025-05-07T20:32:54.5486019Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5486120Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5486200Z E ^ 2025-05-07T20:32:54.5486566Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5486574Z 2025-05-07T20:32:54.5487009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5487016Z 2025-05-07T20:32:54.5487119Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5487349Z self=, 2025-05-07T20:32:54.5487438Z T=128, 2025-05-07T20:32:54.5487515Z D=5120, 2025-05-07T20:32:54.5487605Z scale_ub=None, 2025-05-07T20:32:54.5487690Z contiguous=True, 2025-05-07T20:32:54.5487776Z compiled=False, 2025-05-07T20:32:54.5487856Z ) 2025-05-07T20:32:54.5488081Z self = 2025-05-07T20:32:54.5488256Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.5488260Z 2025-05-07T20:32:54.5488344Z @given( 2025-05-07T20:32:54.5488464Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5488566Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5488688Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5488851Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5488978Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5489057Z ) 2025-05-07T20:32:54.5489356Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5489456Z def test_silu_mul_quant( 2025-05-07T20:32:54.5489533Z self, 2025-05-07T20:32:54.5489611Z T: int, 2025-05-07T20:32:54.5489696Z D: int, 2025-05-07T20:32:54.5489794Z scale_ub: Optional[float], 2025-05-07T20:32:54.5489949Z contiguous: bool, 2025-05-07T20:32:54.5490047Z compiled: bool, 2025-05-07T20:32:54.5490125Z ) -> None: 2025-05-07T20:32:54.5490219Z torch.manual_seed(2025) 2025-05-07T20:32:54.5490298Z 2025-05-07T20:32:54.5490473Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5490553Z 2025-05-07T20:32:54.5490646Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5490770Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5490869Z x = x_sign * x_clamp 2025-05-07T20:32:54.5490949Z x0 = x[:, :D] 2025-05-07T20:32:54.5491030Z x1 = x[:, D:] 2025-05-07T20:32:54.5491147Z 2025-05-07T20:32:54.5491233Z if contiguous: 2025-05-07T20:32:54.5491324Z x0 = x0.contiguous() 2025-05-07T20:32:54.5491420Z x1 = x1.contiguous() 2025-05-07T20:32:54.5491490Z 2025-05-07T20:32:54.5491580Z if scale_ub is not None: 2025-05-07T20:32:54.5491690Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5491913Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5491990Z ) 2025-05-07T20:32:54.5492073Z else: 2025-05-07T20:32:54.5492168Z scale_ub_tensor = None 2025-05-07T20:32:54.5492292Z 2025-05-07T20:32:54.5492423Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5492513Z op = silu_mul_quant 2025-05-07T20:32:54.5492603Z if compiled: 2025-05-07T20:32:54.5492704Z op = torch.compile(op) 2025-05-07T20:32:54.5492810Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5492888Z 2025-05-07T20:32:54.5492982Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5492987Z 2025-05-07T20:32:54.5493083Z moe/activation_test.py:117: 2025-05-07T20:32:54.5493220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5493319Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5493428Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5493942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5494038Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5494416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5494645Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5494996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5495102Z kernel = self.compile( 2025-05-07T20:32:54.5495499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5495683Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5495812Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5495819Z 2025-05-07T20:32:54.5496027Z self = 2025-05-07T20:32:54.5496840Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5497411Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd871da3420>} 2025-05-07T20:32:54.5498200Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5498394Z context = 2025-05-07T20:32:54.5498437Z 2025-05-07T20:32:54.5498613Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5498884Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5498995Z module_map=module_map) 2025-05-07T20:32:54.5499167Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5499266Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5499345Z E ^ 2025-05-07T20:32:54.5499720Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5499725Z 2025-05-07T20:32:54.5500192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5500197Z 2025-05-07T20:32:54.5500307Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5500536Z self=, 2025-05-07T20:32:54.5500615Z T=128, 2025-05-07T20:32:54.5500699Z D=7168, 2025-05-07T20:32:54.5500781Z scale_ub=None, 2025-05-07T20:32:54.5500866Z contiguous=True, 2025-05-07T20:32:54.5500955Z compiled=False, 2025-05-07T20:32:54.5501072Z ) 2025-05-07T20:32:54.5501298Z self = 2025-05-07T20:32:54.5501477Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.5501482Z 2025-05-07T20:32:54.5501562Z @given( 2025-05-07T20:32:54.5501688Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5501792Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5501906Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5502030Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5502147Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5502221Z ) 2025-05-07T20:32:54.5502481Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5502574Z def test_silu_mul_quant( 2025-05-07T20:32:54.5502655Z self, 2025-05-07T20:32:54.5502733Z T: int, 2025-05-07T20:32:54.5502813Z D: int, 2025-05-07T20:32:54.5502920Z scale_ub: Optional[float], 2025-05-07T20:32:54.5503010Z contiguous: bool, 2025-05-07T20:32:54.5503096Z compiled: bool, 2025-05-07T20:32:54.5503179Z ) -> None: 2025-05-07T20:32:54.5503277Z torch.manual_seed(2025) 2025-05-07T20:32:54.5503350Z 2025-05-07T20:32:54.5503526Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5503603Z 2025-05-07T20:32:54.5503695Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5503827Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5503916Z x = x_sign * x_clamp 2025-05-07T20:32:54.5503997Z x0 = x[:, :D] 2025-05-07T20:32:54.5504085Z x1 = x[:, D:] 2025-05-07T20:32:54.5504159Z 2025-05-07T20:32:54.5504250Z if contiguous: 2025-05-07T20:32:54.5504341Z x0 = x0.contiguous() 2025-05-07T20:32:54.5504430Z x1 = x1.contiguous() 2025-05-07T20:32:54.5504512Z 2025-05-07T20:32:54.5504604Z if scale_ub is not None: 2025-05-07T20:32:54.5504709Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5504852Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5504927Z ) 2025-05-07T20:32:54.5505056Z else: 2025-05-07T20:32:54.5505160Z scale_ub_tensor = None 2025-05-07T20:32:54.5505233Z 2025-05-07T20:32:54.5505366Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5505462Z op = silu_mul_quant 2025-05-07T20:32:54.5505546Z if compiled: 2025-05-07T20:32:54.5505653Z op = torch.compile(op) 2025-05-07T20:32:54.5505758Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5505829Z 2025-05-07T20:32:54.5505969Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5505973Z 2025-05-07T20:32:54.5506069Z moe/activation_test.py:117: 2025-05-07T20:32:54.5506517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5506635Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5506736Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5507257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5507360Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5507866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5508101Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5508482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5508592Z kernel = self.compile( 2025-05-07T20:32:54.5509006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5509184Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5509387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5509392Z 2025-05-07T20:32:54.5509604Z self = 2025-05-07T20:32:54.5510415Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5510940Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd8719484a0>} 2025-05-07T20:32:54.5511718Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5511918Z context = 2025-05-07T20:32:54.5511922Z 2025-05-07T20:32:54.5512089Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5512363Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5512475Z module_map=module_map) 2025-05-07T20:32:54.5512642Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5512746Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5512825Z E ^ 2025-05-07T20:32:54.5513191Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5513198Z 2025-05-07T20:32:54.5513632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5513637Z 2025-05-07T20:32:54.5513740Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5513977Z self=, 2025-05-07T20:32:54.5514053Z T=2048, 2025-05-07T20:32:54.5514129Z D=7168, 2025-05-07T20:32:54.5514217Z scale_ub=1200.0, 2025-05-07T20:32:54.5514376Z contiguous=True, 2025-05-07T20:32:54.5514463Z compiled=False, 2025-05-07T20:32:54.5514542Z ) 2025-05-07T20:32:54.5514768Z self = 2025-05-07T20:32:54.5514947Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.5514951Z 2025-05-07T20:32:54.5515034Z @given( 2025-05-07T20:32:54.5515155Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5515253Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5515444Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5515560Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5515679Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5515758Z ) 2025-05-07T20:32:54.5516011Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5516111Z def test_silu_mul_quant( 2025-05-07T20:32:54.5516194Z self, 2025-05-07T20:32:54.5516272Z T: int, 2025-05-07T20:32:54.5516354Z D: int, 2025-05-07T20:32:54.5516450Z scale_ub: Optional[float], 2025-05-07T20:32:54.5516583Z contiguous: bool, 2025-05-07T20:32:54.5516677Z compiled: bool, 2025-05-07T20:32:54.5516775Z ) -> None: 2025-05-07T20:32:54.5516877Z torch.manual_seed(2025) 2025-05-07T20:32:54.5516951Z 2025-05-07T20:32:54.5517123Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5519024Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.5519075Z 2025-05-07T20:32:54.5524646Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.5524657Z 2025-05-07T20:32:54.5524788Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5525022Z self=, 2025-05-07T20:32:54.5525106Z T=1, 2025-05-07T20:32:54.5525184Z D=5120, 2025-05-07T20:32:54.5525269Z scale_ub=1200.0, 2025-05-07T20:32:54.5525364Z contiguous=True, 2025-05-07T20:32:54.5525450Z compiled=False, 2025-05-07T20:32:54.5525522Z ) 2025-05-07T20:32:54.5525756Z self = 2025-05-07T20:32:54.5525934Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.5525938Z 2025-05-07T20:32:54.5526016Z @given( 2025-05-07T20:32:54.5526143Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5526247Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5526370Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5526490Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5526606Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5526688Z ) 2025-05-07T20:32:54.5526942Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5527039Z def test_silu_mul_quant( 2025-05-07T20:32:54.5527124Z self, 2025-05-07T20:32:54.5527202Z T: int, 2025-05-07T20:32:54.5527281Z D: int, 2025-05-07T20:32:54.5527387Z scale_ub: Optional[float], 2025-05-07T20:32:54.5527478Z contiguous: bool, 2025-05-07T20:32:54.5527568Z compiled: bool, 2025-05-07T20:32:54.5527657Z ) -> None: 2025-05-07T20:32:54.5527753Z torch.manual_seed(2025) 2025-05-07T20:32:54.5527831Z 2025-05-07T20:32:54.5528114Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5528191Z 2025-05-07T20:32:54.5528292Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5528427Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5528519Z x = x_sign * x_clamp 2025-05-07T20:32:54.5528608Z x0 = x[:, :D] 2025-05-07T20:32:54.5528690Z x1 = x[:, D:] 2025-05-07T20:32:54.5528763Z 2025-05-07T20:32:54.5528857Z if contiguous: 2025-05-07T20:32:54.5528950Z x0 = x0.contiguous() 2025-05-07T20:32:54.5529084Z x1 = x1.contiguous() 2025-05-07T20:32:54.5529164Z 2025-05-07T20:32:54.5529258Z if scale_ub is not None: 2025-05-07T20:32:54.5529365Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5529514Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5529593Z ) 2025-05-07T20:32:54.5529676Z else: 2025-05-07T20:32:54.5529771Z scale_ub_tensor = None 2025-05-07T20:32:54.5529848Z 2025-05-07T20:32:54.5529990Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5530081Z op = silu_mul_quant 2025-05-07T20:32:54.5530214Z if compiled: 2025-05-07T20:32:54.5530323Z op = torch.compile(op) 2025-05-07T20:32:54.5530429Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5530503Z 2025-05-07T20:32:54.5530602Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5530607Z 2025-05-07T20:32:54.5530705Z moe/activation_test.py:117: 2025-05-07T20:32:54.5530853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5530961Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5531065Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5531637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5531740Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5532192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5532434Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5532790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5532894Z kernel = self.compile( 2025-05-07T20:32:54.5533292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5533476Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5533615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5533622Z 2025-05-07T20:32:54.5533835Z self = 2025-05-07T20:32:54.5534654Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5535176Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd871949a80>} 2025-05-07T20:32:54.5535948Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5536149Z context = 2025-05-07T20:32:54.5536153Z 2025-05-07T20:32:54.5536324Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5536601Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5536754Z module_map=module_map) 2025-05-07T20:32:54.5536920Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5537024Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5537103Z E ^ 2025-05-07T20:32:54.5537470Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5537481Z 2025-05-07T20:32:54.5537907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5537952Z 2025-05-07T20:32:54.5538056Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5538318Z self=, 2025-05-07T20:32:54.5538405Z T=2048, 2025-05-07T20:32:54.5538497Z D=5120, 2025-05-07T20:32:54.5538585Z scale_ub=None, 2025-05-07T20:32:54.5538670Z contiguous=True, 2025-05-07T20:32:54.5538754Z compiled=False, 2025-05-07T20:32:54.5538832Z ) 2025-05-07T20:32:54.5539057Z self = 2025-05-07T20:32:54.5539242Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.5539288Z 2025-05-07T20:32:54.5539367Z @given( 2025-05-07T20:32:54.5539487Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5539593Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5539708Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5539826Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5539946Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5540020Z ) 2025-05-07T20:32:54.5540278Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5540413Z def test_silu_mul_quant( 2025-05-07T20:32:54.5540490Z self, 2025-05-07T20:32:54.5540571Z T: int, 2025-05-07T20:32:54.5540646Z D: int, 2025-05-07T20:32:54.5540745Z scale_ub: Optional[float], 2025-05-07T20:32:54.5540838Z contiguous: bool, 2025-05-07T20:32:54.5540924Z compiled: bool, 2025-05-07T20:32:54.5541004Z ) -> None: 2025-05-07T20:32:54.5541104Z torch.manual_seed(2025) 2025-05-07T20:32:54.5541176Z 2025-05-07T20:32:54.5541345Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5541425Z 2025-05-07T20:32:54.5541518Z > x_sign = torch.sign(x) 2025-05-07T20:32:54.5543386Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.5543398Z 2025-05-07T20:32:54.5543518Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:54.5543523Z 2025-05-07T20:32:54.5543634Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5543864Z self=, 2025-05-07T20:32:54.5543942Z T=16384, 2025-05-07T20:32:54.5544023Z D=5120, 2025-05-07T20:32:54.5544105Z scale_ub=None, 2025-05-07T20:32:54.5544195Z contiguous=True, 2025-05-07T20:32:54.5544289Z compiled=False, 2025-05-07T20:32:54.5544361Z ) 2025-05-07T20:32:54.5544584Z self = 2025-05-07T20:32:54.5544771Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.5544778Z 2025-05-07T20:32:54.5544855Z @given( 2025-05-07T20:32:54.5544982Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5545131Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5545249Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5545375Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5545489Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5545564Z ) 2025-05-07T20:32:54.5545824Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5545919Z def test_silu_mul_quant( 2025-05-07T20:32:54.5545996Z self, 2025-05-07T20:32:54.5546120Z T: int, 2025-05-07T20:32:54.5546197Z D: int, 2025-05-07T20:32:54.5546295Z scale_ub: Optional[float], 2025-05-07T20:32:54.5546393Z contiguous: bool, 2025-05-07T20:32:54.5546479Z compiled: bool, 2025-05-07T20:32:54.5546567Z ) -> None: 2025-05-07T20:32:54.5546662Z torch.manual_seed(2025) 2025-05-07T20:32:54.5546734Z 2025-05-07T20:32:54.5546910Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5548810Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.5548818Z 2025-05-07T20:32:54.5548962Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.5548967Z 2025-05-07T20:32:54.5549134Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5549372Z self=, 2025-05-07T20:32:54.5549456Z T=4096, 2025-05-07T20:32:54.5549532Z D=5120, 2025-05-07T20:32:54.5549621Z scale_ub=None, 2025-05-07T20:32:54.5549713Z contiguous=True, 2025-05-07T20:32:54.5549805Z compiled=False, 2025-05-07T20:32:54.5549885Z ) 2025-05-07T20:32:54.5550110Z self = 2025-05-07T20:32:54.5550286Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.5550291Z 2025-05-07T20:32:54.5550377Z @given( 2025-05-07T20:32:54.5550495Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5550597Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5550720Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5550836Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5550958Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5551033Z ) 2025-05-07T20:32:54.5551285Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5551387Z def test_silu_mul_quant( 2025-05-07T20:32:54.5551467Z self, 2025-05-07T20:32:54.5551543Z T: int, 2025-05-07T20:32:54.5551625Z D: int, 2025-05-07T20:32:54.5551724Z scale_ub: Optional[float], 2025-05-07T20:32:54.5551813Z contiguous: bool, 2025-05-07T20:32:54.5551904Z compiled: bool, 2025-05-07T20:32:54.5551981Z ) -> None: 2025-05-07T20:32:54.5552073Z torch.manual_seed(2025) 2025-05-07T20:32:54.5552151Z 2025-05-07T20:32:54.5552320Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5554229Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.5554238Z 2025-05-07T20:32:54.5554360Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.5554365Z 2025-05-07T20:32:54.5554472Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5554699Z self=, 2025-05-07T20:32:54.5554776Z T=2048, 2025-05-07T20:32:54.5554858Z D=5120, 2025-05-07T20:32:54.5554983Z scale_ub=None, 2025-05-07T20:32:54.5555068Z contiguous=False, 2025-05-07T20:32:54.5555158Z compiled=False, 2025-05-07T20:32:54.5555230Z ) 2025-05-07T20:32:54.5555452Z self = 2025-05-07T20:32:54.5555638Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.5555645Z 2025-05-07T20:32:54.5555723Z @given( 2025-05-07T20:32:54.5555848Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5555949Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5556062Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5556227Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5556345Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5556419Z ) 2025-05-07T20:32:54.5556678Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5556775Z def test_silu_mul_quant( 2025-05-07T20:32:54.5556851Z self, 2025-05-07T20:32:54.5556931Z T: int, 2025-05-07T20:32:54.5557008Z D: int, 2025-05-07T20:32:54.5557107Z scale_ub: Optional[float], 2025-05-07T20:32:54.5557267Z contiguous: bool, 2025-05-07T20:32:54.5557353Z compiled: bool, 2025-05-07T20:32:54.5557436Z ) -> None: 2025-05-07T20:32:54.5557531Z torch.manual_seed(2025) 2025-05-07T20:32:54.5557604Z 2025-05-07T20:32:54.5557780Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5559631Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.5559638Z 2025-05-07T20:32:54.5559761Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.5559768Z 2025-05-07T20:32:54.5559871Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5560097Z self=, 2025-05-07T20:32:54.5560183Z T=4096, 2025-05-07T20:32:54.5560259Z D=7168, 2025-05-07T20:32:54.5560345Z scale_ub=None, 2025-05-07T20:32:54.5560435Z contiguous=True, 2025-05-07T20:32:54.5560520Z compiled=True, 2025-05-07T20:32:54.5560599Z ) 2025-05-07T20:32:54.5560822Z self = 2025-05-07T20:32:54.5560993Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.5560998Z 2025-05-07T20:32:54.5561082Z @given( 2025-05-07T20:32:54.5561202Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5561301Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5561421Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5561540Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5561653Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5561738Z ) 2025-05-07T20:32:54.5562034Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5562136Z def test_silu_mul_quant( 2025-05-07T20:32:54.5562213Z self, 2025-05-07T20:32:54.5562292Z T: int, 2025-05-07T20:32:54.5562380Z D: int, 2025-05-07T20:32:54.5562479Z scale_ub: Optional[float], 2025-05-07T20:32:54.5562569Z contiguous: bool, 2025-05-07T20:32:54.5562663Z compiled: bool, 2025-05-07T20:32:54.5562742Z ) -> None: 2025-05-07T20:32:54.5562838Z torch.manual_seed(2025) 2025-05-07T20:32:54.5562961Z 2025-05-07T20:32:54.5563131Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5565030Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.5565039Z 2025-05-07T20:32:54.5565158Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.5565162Z 2025-05-07T20:32:54.5565270Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5565497Z self=, 2025-05-07T20:32:54.5565577Z T=2048, 2025-05-07T20:32:54.5565666Z D=5120, 2025-05-07T20:32:54.5565750Z scale_ub=1200.0, 2025-05-07T20:32:54.5565836Z contiguous=False, 2025-05-07T20:32:54.5565927Z compiled=False, 2025-05-07T20:32:54.5566045Z ) 2025-05-07T20:32:54.5566268Z self = 2025-05-07T20:32:54.5566456Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.5566461Z 2025-05-07T20:32:54.5566541Z @given( 2025-05-07T20:32:54.5566667Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5566765Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5566881Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5567005Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5567118Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5567195Z ) 2025-05-07T20:32:54.5567453Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5567548Z def test_silu_mul_quant( 2025-05-07T20:32:54.5567623Z self, 2025-05-07T20:32:54.5567704Z T: int, 2025-05-07T20:32:54.5567779Z D: int, 2025-05-07T20:32:54.5567878Z scale_ub: Optional[float], 2025-05-07T20:32:54.5567971Z contiguous: bool, 2025-05-07T20:32:54.5568056Z compiled: bool, 2025-05-07T20:32:54.5568138Z ) -> None: 2025-05-07T20:32:54.5568237Z torch.manual_seed(2025) 2025-05-07T20:32:54.5568310Z 2025-05-07T20:32:54.5568484Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5570331Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.5570341Z 2025-05-07T20:32:54.5570464Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.5570469Z 2025-05-07T20:32:54.5570569Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5570845Z self=, 2025-05-07T20:32:54.5570930Z T=4096, 2025-05-07T20:32:54.5571006Z D=7168, 2025-05-07T20:32:54.5571091Z scale_ub=1200.0, 2025-05-07T20:32:54.5571185Z contiguous=True, 2025-05-07T20:32:54.5571269Z compiled=False, 2025-05-07T20:32:54.5571348Z ) 2025-05-07T20:32:54.5571571Z self = 2025-05-07T20:32:54.5571748Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.5571800Z 2025-05-07T20:32:54.5571957Z @given( 2025-05-07T20:32:54.5572076Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5572174Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5572296Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5572412Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5572530Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5572605Z ) 2025-05-07T20:32:54.5572860Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5572959Z def test_silu_mul_quant( 2025-05-07T20:32:54.5573078Z self, 2025-05-07T20:32:54.5573157Z T: int, 2025-05-07T20:32:54.5573239Z D: int, 2025-05-07T20:32:54.5573336Z scale_ub: Optional[float], 2025-05-07T20:32:54.5573423Z contiguous: bool, 2025-05-07T20:32:54.5573517Z compiled: bool, 2025-05-07T20:32:54.5573594Z ) -> None: 2025-05-07T20:32:54.5573690Z torch.manual_seed(2025) 2025-05-07T20:32:54.5573770Z 2025-05-07T20:32:54.5573941Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5575796Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.5575843Z 2025-05-07T20:32:54.5575967Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.5575971Z 2025-05-07T20:32:54.5576079Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5576310Z self=, 2025-05-07T20:32:54.5576391Z T=16384, 2025-05-07T20:32:54.5576474Z D=7168, 2025-05-07T20:32:54.5576556Z scale_ub=None, 2025-05-07T20:32:54.5576643Z contiguous=False, 2025-05-07T20:32:54.5576736Z compiled=True, 2025-05-07T20:32:54.5576809Z ) 2025-05-07T20:32:54.5577030Z self = 2025-05-07T20:32:54.5577219Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.5577224Z 2025-05-07T20:32:54.5577300Z @given( 2025-05-07T20:32:54.5577430Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5577528Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5577641Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5577763Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5577876Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5577951Z ) 2025-05-07T20:32:54.5578211Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5578305Z def test_silu_mul_quant( 2025-05-07T20:32:54.5578381Z self, 2025-05-07T20:32:54.5578467Z T: int, 2025-05-07T20:32:54.5578543Z D: int, 2025-05-07T20:32:54.5578640Z scale_ub: Optional[float], 2025-05-07T20:32:54.5578736Z contiguous: bool, 2025-05-07T20:32:54.5578821Z compiled: bool, 2025-05-07T20:32:54.5578952Z ) -> None: 2025-05-07T20:32:54.5579048Z torch.manual_seed(2025) 2025-05-07T20:32:54.5579125Z 2025-05-07T20:32:54.5579303Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5581149Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.5581196Z 2025-05-07T20:32:54.5581324Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.5581328Z 2025-05-07T20:32:54.5581432Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5581659Z self=, 2025-05-07T20:32:54.5581742Z T=4096, 2025-05-07T20:32:54.5581870Z D=7168, 2025-05-07T20:32:54.5581953Z scale_ub=None, 2025-05-07T20:32:54.5582046Z contiguous=True, 2025-05-07T20:32:54.5582133Z compiled=False, 2025-05-07T20:32:54.5582214Z ) 2025-05-07T20:32:54.5582436Z self = 2025-05-07T20:32:54.5582617Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.5582622Z 2025-05-07T20:32:54.5582705Z @given( 2025-05-07T20:32:54.5582824Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5582964Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5583086Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5583202Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5583326Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5583401Z ) 2025-05-07T20:32:54.5583651Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5583755Z def test_silu_mul_quant( 2025-05-07T20:32:54.5583831Z self, 2025-05-07T20:32:54.5583907Z T: int, 2025-05-07T20:32:54.5583989Z D: int, 2025-05-07T20:32:54.5584086Z scale_ub: Optional[float], 2025-05-07T20:32:54.5584172Z contiguous: bool, 2025-05-07T20:32:54.5584266Z compiled: bool, 2025-05-07T20:32:54.5584344Z ) -> None: 2025-05-07T20:32:54.5584439Z torch.manual_seed(2025) 2025-05-07T20:32:54.5584519Z 2025-05-07T20:32:54.5584686Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5586545Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.5586551Z 2025-05-07T20:32:54.5586668Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.5586673Z 2025-05-07T20:32:54.5586781Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5587008Z self=, 2025-05-07T20:32:54.5587085Z T=16384, 2025-05-07T20:32:54.5587167Z D=7168, 2025-05-07T20:32:54.5587262Z scale_ub=None, 2025-05-07T20:32:54.5587350Z contiguous=True, 2025-05-07T20:32:54.5587443Z compiled=False, 2025-05-07T20:32:54.5587515Z ) 2025-05-07T20:32:54.5587785Z self = 2025-05-07T20:32:54.5587976Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.5587980Z 2025-05-07T20:32:54.5588060Z @given( 2025-05-07T20:32:54.5588182Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5588287Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5588400Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5588517Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5588676Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5588751Z ) 2025-05-07T20:32:54.5589009Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5589107Z def test_silu_mul_quant( 2025-05-07T20:32:54.5589185Z self, 2025-05-07T20:32:54.5589268Z T: int, 2025-05-07T20:32:54.5589346Z D: int, 2025-05-07T20:32:54.5589444Z scale_ub: Optional[float], 2025-05-07T20:32:54.5589544Z contiguous: bool, 2025-05-07T20:32:54.5589631Z compiled: bool, 2025-05-07T20:32:54.5589708Z ) -> None: 2025-05-07T20:32:54.5589808Z torch.manual_seed(2025) 2025-05-07T20:32:54.5589949Z 2025-05-07T20:32:54.5590120Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5591975Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.5592021Z 2025-05-07T20:32:54.5592141Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.5592155Z 2025-05-07T20:32:54.5592257Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5592484Z self=, 2025-05-07T20:32:54.5592569Z T=16384, 2025-05-07T20:32:54.5592648Z D=7168, 2025-05-07T20:32:54.5592732Z scale_ub=1200.0, 2025-05-07T20:32:54.5592827Z contiguous=True, 2025-05-07T20:32:54.5592911Z compiled=False, 2025-05-07T20:32:54.5592982Z ) 2025-05-07T20:32:54.5593214Z self = 2025-05-07T20:32:54.5593399Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.5593404Z 2025-05-07T20:32:54.5593485Z @given( 2025-05-07T20:32:54.5593608Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5593705Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5593826Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5593944Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5594059Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5594137Z ) 2025-05-07T20:32:54.5594391Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5594485Z def test_silu_mul_quant( 2025-05-07T20:32:54.5594567Z self, 2025-05-07T20:32:54.5594642Z T: int, 2025-05-07T20:32:54.5594717Z D: int, 2025-05-07T20:32:54.5594823Z scale_ub: Optional[float], 2025-05-07T20:32:54.5594916Z contiguous: bool, 2025-05-07T20:32:54.5595006Z compiled: bool, 2025-05-07T20:32:54.5595082Z ) -> None: 2025-05-07T20:32:54.5595177Z torch.manual_seed(2025) 2025-05-07T20:32:54.5595257Z 2025-05-07T20:32:54.5595425Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5597323Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.5597336Z 2025-05-07T20:32:54.5597454Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.5597495Z 2025-05-07T20:32:54.5597597Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5597830Z self=, 2025-05-07T20:32:54.5597914Z T=128, 2025-05-07T20:32:54.5597990Z D=5120, 2025-05-07T20:32:54.5598083Z scale_ub=1200.0, 2025-05-07T20:32:54.5598169Z contiguous=False, 2025-05-07T20:32:54.5598260Z compiled=False, 2025-05-07T20:32:54.5598337Z ) 2025-05-07T20:32:54.5598561Z self = 2025-05-07T20:32:54.5598787Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.5598793Z 2025-05-07T20:32:54.5598871Z @given( 2025-05-07T20:32:54.5598990Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5599094Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5599207Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5599324Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5599443Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5599515Z ) 2025-05-07T20:32:54.5599772Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5599906Z def test_silu_mul_quant( 2025-05-07T20:32:54.5599981Z self, 2025-05-07T20:32:54.5600062Z T: int, 2025-05-07T20:32:54.5600138Z D: int, 2025-05-07T20:32:54.5600238Z scale_ub: Optional[float], 2025-05-07T20:32:54.5600334Z contiguous: bool, 2025-05-07T20:32:54.5600418Z compiled: bool, 2025-05-07T20:32:54.5600498Z ) -> None: 2025-05-07T20:32:54.5600597Z torch.manual_seed(2025) 2025-05-07T20:32:54.5600670Z 2025-05-07T20:32:54.5600837Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5600921Z 2025-05-07T20:32:54.5601012Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5601147Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5601236Z x = x_sign * x_clamp 2025-05-07T20:32:54.5601314Z x0 = x[:, :D] 2025-05-07T20:32:54.5601403Z x1 = x[:, D:] 2025-05-07T20:32:54.5601478Z 2025-05-07T20:32:54.5601562Z if contiguous: 2025-05-07T20:32:54.5601662Z x0 = x0.contiguous() 2025-05-07T20:32:54.5601752Z x1 = x1.contiguous() 2025-05-07T20:32:54.5601825Z 2025-05-07T20:32:54.5601922Z if scale_ub is not None: 2025-05-07T20:32:54.5602029Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5602166Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5602250Z ) 2025-05-07T20:32:54.5602327Z else: 2025-05-07T20:32:54.5602421Z scale_ub_tensor = None 2025-05-07T20:32:54.5602497Z 2025-05-07T20:32:54.5602627Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5602724Z op = silu_mul_quant 2025-05-07T20:32:54.5602810Z if compiled: 2025-05-07T20:32:54.5602913Z op = torch.compile(op) 2025-05-07T20:32:54.5603025Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5603098Z 2025-05-07T20:32:54.5603193Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5603197Z 2025-05-07T20:32:54.5603300Z moe/activation_test.py:117: 2025-05-07T20:32:54.5603433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5603586Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5603693Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5604215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5604318Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5604689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5604921Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5605322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5605420Z kernel = self.compile( 2025-05-07T20:32:54.5605822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5606004Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5606390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5606396Z 2025-05-07T20:32:54.5606769Z self = 2025-05-07T20:32:54.5607577Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5608109Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd8718307c0>} 2025-05-07T20:32:54.5608910Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5609196Z context = 2025-05-07T20:32:54.5609201Z 2025-05-07T20:32:54.5609376Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5609654Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5609766Z module_map=module_map) 2025-05-07T20:32:54.5609931Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5610030Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5610116Z E ^ 2025-05-07T20:32:54.5610484Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5610489Z 2025-05-07T20:32:54.5610921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5610931Z 2025-05-07T20:32:54.5611035Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5611266Z self=, 2025-05-07T20:32:54.5611354Z T=2048, 2025-05-07T20:32:54.5611430Z D=7168, 2025-05-07T20:32:54.5611515Z scale_ub=None, 2025-05-07T20:32:54.5611606Z contiguous=False, 2025-05-07T20:32:54.5611690Z compiled=False, 2025-05-07T20:32:54.5611762Z ) 2025-05-07T20:32:54.5612044Z self = 2025-05-07T20:32:54.5612223Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.5612231Z 2025-05-07T20:32:54.5612317Z @given( 2025-05-07T20:32:54.5612436Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5612533Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5612657Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5612775Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5612889Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5613041Z ) 2025-05-07T20:32:54.5613295Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5613390Z def test_silu_mul_quant( 2025-05-07T20:32:54.5613476Z self, 2025-05-07T20:32:54.5613551Z T: int, 2025-05-07T20:32:54.5613628Z D: int, 2025-05-07T20:32:54.5613732Z scale_ub: Optional[float], 2025-05-07T20:32:54.5613822Z contiguous: bool, 2025-05-07T20:32:54.5613913Z compiled: bool, 2025-05-07T20:32:54.5614155Z ) -> None: 2025-05-07T20:32:54.5614249Z torch.manual_seed(2025) 2025-05-07T20:32:54.5614329Z 2025-05-07T20:32:54.5614498Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5616405Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.5616418Z 2025-05-07T20:32:54.5616538Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.5616542Z 2025-05-07T20:32:54.5616644Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5616878Z self=, 2025-05-07T20:32:54.5616956Z T=128, 2025-05-07T20:32:54.5617035Z D=7168, 2025-05-07T20:32:54.5617128Z scale_ub=1200.0, 2025-05-07T20:32:54.5617253Z contiguous=True, 2025-05-07T20:32:54.5617342Z compiled=True, 2025-05-07T20:32:54.5617416Z ) 2025-05-07T20:32:54.5617638Z self = 2025-05-07T20:32:54.5617818Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.5617823Z 2025-05-07T20:32:54.5617901Z @given( 2025-05-07T20:32:54.5618020Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5618125Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5618239Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5618356Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5618501Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5618584Z ) 2025-05-07T20:32:54.5618864Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5618962Z def test_silu_mul_quant( 2025-05-07T20:32:54.5619041Z self, 2025-05-07T20:32:54.5619124Z T: int, 2025-05-07T20:32:54.5619201Z D: int, 2025-05-07T20:32:54.5619300Z scale_ub: Optional[float], 2025-05-07T20:32:54.5619400Z contiguous: bool, 2025-05-07T20:32:54.5619488Z compiled: bool, 2025-05-07T20:32:54.5619570Z ) -> None: 2025-05-07T20:32:54.5619673Z torch.manual_seed(2025) 2025-05-07T20:32:54.5619745Z 2025-05-07T20:32:54.5619918Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5620000Z 2025-05-07T20:32:54.5620093Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5620226Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5620315Z x = x_sign * x_clamp 2025-05-07T20:32:54.5620396Z x0 = x[:, :D] 2025-05-07T20:32:54.5620483Z x1 = x[:, D:] 2025-05-07T20:32:54.5620556Z 2025-05-07T20:32:54.5620638Z if contiguous: 2025-05-07T20:32:54.5620736Z x0 = x0.contiguous() 2025-05-07T20:32:54.5620828Z x1 = x1.contiguous() 2025-05-07T20:32:54.5620900Z 2025-05-07T20:32:54.5620999Z if scale_ub is not None: 2025-05-07T20:32:54.5621105Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5621289Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5621376Z ) 2025-05-07T20:32:54.5621452Z else: 2025-05-07T20:32:54.5621557Z scale_ub_tensor = None 2025-05-07T20:32:54.5621630Z 2025-05-07T20:32:54.5621760Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5621856Z op = silu_mul_quant 2025-05-07T20:32:54.5621943Z if compiled: 2025-05-07T20:32:54.5622042Z op = torch.compile(op) 2025-05-07T20:32:54.5622219Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5622292Z 2025-05-07T20:32:54.5622382Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5622386Z 2025-05-07T20:32:54.5622486Z moe/activation_test.py:117: 2025-05-07T20:32:54.5622622Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5622729Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5622829Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5623214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.5623313Z return fn(*args, **kwargs) 2025-05-07T20:32:54.5623870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5623969Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5624343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5624575Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5624929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5625065Z kernel = self.compile( 2025-05-07T20:32:54.5625460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5625650Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5625783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5625788Z 2025-05-07T20:32:54.5625997Z self = 2025-05-07T20:32:54.5626811Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5627338Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd871831940>} 2025-05-07T20:32:54.5628123Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5628315Z context = 2025-05-07T20:32:54.5628320Z 2025-05-07T20:32:54.5628495Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5628766Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5628874Z module_map=module_map) 2025-05-07T20:32:54.5629045Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5629145Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5629223Z E ^ 2025-05-07T20:32:54.5629598Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5629605Z 2025-05-07T20:32:54.5630033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5630038Z 2025-05-07T20:32:54.5630192Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5630423Z self=, 2025-05-07T20:32:54.5630502Z T=128, 2025-05-07T20:32:54.5630585Z D=7168, 2025-05-07T20:32:54.5630668Z scale_ub=1200.0, 2025-05-07T20:32:54.5630754Z contiguous=True, 2025-05-07T20:32:54.5630845Z compiled=False, 2025-05-07T20:32:54.5630916Z ) 2025-05-07T20:32:54.5631146Z self = 2025-05-07T20:32:54.5631362Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.5631366Z 2025-05-07T20:32:54.5631444Z @given( 2025-05-07T20:32:54.5631570Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5631674Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5631789Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5631911Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5632028Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5632102Z ) 2025-05-07T20:32:54.5632400Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5632496Z def test_silu_mul_quant( 2025-05-07T20:32:54.5632580Z self, 2025-05-07T20:32:54.5632656Z T: int, 2025-05-07T20:32:54.5632733Z D: int, 2025-05-07T20:32:54.5632836Z scale_ub: Optional[float], 2025-05-07T20:32:54.5632927Z contiguous: bool, 2025-05-07T20:32:54.5633014Z compiled: bool, 2025-05-07T20:32:54.5633098Z ) -> None: 2025-05-07T20:32:54.5633193Z torch.manual_seed(2025) 2025-05-07T20:32:54.5633266Z 2025-05-07T20:32:54.5633443Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5633561Z 2025-05-07T20:32:54.5633654Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5633787Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5635651Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.5635665Z 2025-05-07T20:32:54.5635785Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:54.5635790Z 2025-05-07T20:32:54.5635892Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5636129Z self=, 2025-05-07T20:32:54.5636206Z T=128, 2025-05-07T20:32:54.5636284Z D=5120, 2025-05-07T20:32:54.5636376Z scale_ub=1200.0, 2025-05-07T20:32:54.5636464Z contiguous=True, 2025-05-07T20:32:54.5636548Z compiled=True, 2025-05-07T20:32:54.5636629Z ) 2025-05-07T20:32:54.5636856Z self = 2025-05-07T20:32:54.5637027Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.5637041Z 2025-05-07T20:32:54.5637117Z @given( 2025-05-07T20:32:54.5637238Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5637344Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5637459Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5637576Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5637696Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5637772Z ) 2025-05-07T20:32:54.5638026Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5638128Z def test_silu_mul_quant( 2025-05-07T20:32:54.5638248Z self, 2025-05-07T20:32:54.5638327Z T: int, 2025-05-07T20:32:54.5638411Z D: int, 2025-05-07T20:32:54.5638512Z scale_ub: Optional[float], 2025-05-07T20:32:54.5638610Z contiguous: bool, 2025-05-07T20:32:54.5638697Z compiled: bool, 2025-05-07T20:32:54.5638775Z ) -> None: 2025-05-07T20:32:54.5638879Z torch.manual_seed(2025) 2025-05-07T20:32:54.5638972Z 2025-05-07T20:32:54.5639169Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5639297Z 2025-05-07T20:32:54.5639392Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5639515Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5641414Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.5641422Z 2025-05-07T20:32:54.5641540Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:54.5641545Z 2025-05-07T20:32:54.5641652Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5641880Z self=, 2025-05-07T20:32:54.5641966Z T=128, 2025-05-07T20:32:54.5642046Z D=7168, 2025-05-07T20:32:54.5642129Z scale_ub=None, 2025-05-07T20:32:54.5642219Z contiguous=True, 2025-05-07T20:32:54.5642341Z compiled=True, 2025-05-07T20:32:54.5642414Z ) 2025-05-07T20:32:54.5642644Z self = 2025-05-07T20:32:54.5642816Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.5642821Z 2025-05-07T20:32:54.5642898Z @given( 2025-05-07T20:32:54.5643021Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5643120Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5643240Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5643357Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5643472Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5643557Z ) 2025-05-07T20:32:54.5643808Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5643902Z def test_silu_mul_quant( 2025-05-07T20:32:54.5643986Z self, 2025-05-07T20:32:54.5644065Z T: int, 2025-05-07T20:32:54.5644141Z D: int, 2025-05-07T20:32:54.5644246Z scale_ub: Optional[float], 2025-05-07T20:32:54.5644335Z contiguous: bool, 2025-05-07T20:32:54.5644421Z compiled: bool, 2025-05-07T20:32:54.5644506Z ) -> None: 2025-05-07T20:32:54.5644600Z torch.manual_seed(2025) 2025-05-07T20:32:54.5644683Z 2025-05-07T20:32:54.5644854Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5646692Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.5646711Z 2025-05-07T20:32:54.5646828Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.5647014Z =============================== warnings summary =============================== 2025-05-07T20:32:54.5647343Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:54.5647658Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:54.5647966Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:54.5648928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:54.5649203Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:54.5649210Z 2025-05-07T20:32:54.5649436Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:54.5649608Z ================= 1 failed, 1 deselected, 3 warnings in 13.16s ================= 2025-05-07T20:32:56.1518099Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:56.2147861Z [EXEC] [ATTEMPT 2/2] Command attempt failed. 2025-05-07T20:32:56.2148534Z 2025-05-07T20:32:56.2149195Z [EXEC] The command has failed after 2 + 1 attempts; aborting. 2025-05-07T20:32:56.2150077Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py 2025-05-07T20:32:56.2150665Z 2025-05-07T20:32:56.2150671Z 2025-05-07T20:32:56.2150677Z 2025-05-07T20:32:56.2168201Z ##[error]Process completed with exit code 1. 2025-05-07T20:32:56.2250004Z Post job cleanup. 2025-05-07T20:32:56.3222282Z [command]/usr/bin/git version 2025-05-07T20:32:56.3262061Z git version 2.47.1 2025-05-07T20:32:56.3296423Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/7d47374c-685e-4702-832f-41fd22dfa44f/.gitconfig' 2025-05-07T20:32:56.3306272Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/7d47374c-685e-4702-832f-41fd22dfa44f' before making global git config changes 2025-05-07T20:32:56.3307145Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:32:56.3318994Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:32:56.3360968Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:32:56.3396002Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:32:56.3734149Z Entering 'external/asmjit' 2025-05-07T20:32:56.3799125Z Entering 'external/composable_kernel' 2025-05-07T20:32:56.3877152Z Entering 'external/cpuinfo' 2025-05-07T20:32:56.3944950Z Entering 'external/cutlass' 2025-05-07T20:32:56.4020129Z Entering 'external/googletest' 2025-05-07T20:32:56.4086147Z Entering 'external/hipify_torch' 2025-05-07T20:32:56.4153550Z Entering 'external/json' 2025-05-07T20:32:56.4238847Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:32:56.4265069Z http.https://github.com/.extraheader 2025-05-07T20:32:56.4277956Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader 2025-05-07T20:32:56.4309956Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:32:56.4637364Z Entering 'external/asmjit' 2025-05-07T20:32:56.4681871Z http.https://github.com/.extraheader 2025-05-07T20:32:56.4723252Z Entering 'external/composable_kernel' 2025-05-07T20:32:56.4767077Z http.https://github.com/.extraheader 2025-05-07T20:32:56.4816568Z Entering 'external/cpuinfo' 2025-05-07T20:32:56.4861911Z http.https://github.com/.extraheader 2025-05-07T20:32:56.4904074Z Entering 'external/cutlass' 2025-05-07T20:32:56.4947086Z http.https://github.com/.extraheader 2025-05-07T20:32:56.4998304Z Entering 'external/googletest' 2025-05-07T20:32:56.5046498Z http.https://github.com/.extraheader 2025-05-07T20:32:56.5088775Z Entering 'external/hipify_torch' 2025-05-07T20:32:56.5132292Z http.https://github.com/.extraheader 2025-05-07T20:32:56.5173935Z Entering 'external/json' 2025-05-07T20:32:56.5217205Z http.https://github.com/.extraheader 2025-05-07T20:32:56.5367835Z A job completed hook has been configured by the self-hosted runner administrator 2025-05-07T20:32:56.5404098Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh' 2025-05-07T20:32:56.5415515Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:32:56.5415886Z ##[endgroup] 2025-05-07T20:32:56.5516149Z [!ALERT!] Swap in detected! [!ALERT!] 2025-05-07T20:33:07.2814744Z [!ALERT!] Swap out detected [!ALERT!] 2025-05-07T20:33:23.7199544Z Cleaning up orphan processes